## 清洗数据，解决连接词分割的问题

#### 说明
##### worker_functions.ipynb文件和此文件要放在同一项目文件夹内。
##### 进程数要视自己计算机的cpu数量而定，超过可能会跑爆cpu.

In [4]:
import os
import sys
import json
import wordninja
import traceback
import wordsegment
import pandas as pd
import multiprocessing
from tqdm import tqdm
from wordsegment import load, segment
from ipynb.fs.full.worker_functions import initializer, parallel_segment,parallel_segment_with_progress

### 测试解决没有空格分隔的英文文本方法哪一个好用

In [5]:
# 导入测试数据
test_path = r'D:\zhenfeng zhou\The Civil War\analyse_code\测试解决连接词包的效能.txt'
with open(test_path,'r',encoding = 'utf-8') as file:
    test = file.read()
test

'Inthestartingpointofthejourney,thereweremanyunexpectedturnsandtwists.Themaincharactersettledinthecountrysidewheretheskywasblueandthegrasswasgreenthroughouttheyear.Eventhoughtheyfacednumerouschallenges,theyremainedoptimisticandperseveredthroughthickandthin.Friendshipandloyaltywerethecornerstonesoftheirrelationships.\n\nInthequietvillage,therewerestoriesofancienttimeswherelegendsofthepastwerebroughttolifeagainandagain.Eachstorywasafascinatingtaleofadventureandmysterythatcaptivatedtheyoungandoldalike.Thevillagersgatheredaroundthecampfireeachnighttotelltheseancienttales,creatingabondthatwaseverlasting.\n\nOneparticularlyinterestingstorywasaboutthebravewarriorwhosetoutonamissiontosavehishomelandfrominvaders.Hisjourneytookhimthroughdarkforests,overhighmountains,andacrosstreacherousrivers.Hefacedmanydangersbutwithcourageanddetermination,heovercameallobstacles.\n\nAshetravelled,hemetkindstrangerswhoofferedhimguidanceandsupport.Theiradviceprovedinvaluableashefacedhismostchallengingtaskyet\n.Th

##### wordninja（处理连在一起的英文单词分割的工具）

In [9]:
split_words = wordninja.split(test)

text = ' '.join(split_words)

text

'In the starting point of the journey there were many unexpected turns and twists The main character settled in the countryside where the sky was blue and the grass was green throughout the year Eventhough they faced numerous challenges they remained optimistic and persevered through thick and thin Friendship and loyalty were the cornerstones of their relationships In the quiet village there were stories of ancient times where legends of the past were brought to life again and again Each story was a fascinating tale of adventure and mystery that captivated the young and old a like The villagers gathered around the campfire each night to tell these ancient tales creating a bond that was everlasting One particularly interesting story was about the brave warrior who set out on a mission to save his homeland from invaders His journey took him through dark forests over high mountains and across treacherous rivers He faced many dangers but with courage and determination he overcame all obsta

#####  wordsegment（用于处理和分割未分词的英语文本）

In [7]:
load()

text = ' '.join(segment(test))

text

'in the starting point of the journey there were many unexpected turns and twists the main character settled in the countryside where the sky was blue and the grass was green throughout the year even though they faced numerous challenges they remained optimistic and persevered through thick and thin friendship and loyalty were the cornerstones of their relationships in the quiet village there were stories of ancient times where legends of the past were brought to life again and again each story was a fascinating tale of adventure and mystery that captivated the young and old alike the villagers gathered around the campfire each night to tell these ancient tales creating a bond that was everlasting one particularly interesting story was about the brave warrior who set out on a mission to save his homeland from invaders his journey took him through dark forests over high mountains and across treacherous rivers he faced many dangers but with courage and determination he overcame all obsta

#### 大语言模型（经测试，此种方法需要词性标注等步骤，较麻烦。）

#### 结论，经过chatgpt4的评估发现
wordninja 运行快速。
wordsegment 分割更加准确,能达到95%以上的准确率。

### 选择使用wordsegment进行单词分割

In [2]:
def process_data(id_content, max_workers):
    final_result = []
    try:
        manager = multiprocessing.Manager()
        progress_queue = manager.Queue()
        with multiprocessing.Pool(processes=max_workers, initializer=initializer) as pool:
            # 使用 pool.apply_async 提交任务，同时传入 progress_queue
            results = [pool.apply_async(parallel_segment_with_progress, args=(id,content, progress_queue)) for id,content in id_content]
            
             # 在主进程中显示进度条，更新队列中的项目
            with tqdm(total=len(id_content)) as pbar:
                for _ in range(len(id_content)):
                    progress_queue.get()
                    pbar.update(1)

            pool.close()
            pool.join()

        # 从 AsyncResult 对象中获取实际结果
        for res in tqdm(results):
            final_result.append(res.get())
            
        return final_result
                      
    except Exception as e:
        print(f"Error in process_data: {e}")
        print(traceback.format_exc())
        sys.exit(1) 

In [3]:
if __name__ == '__main__':
    try:
        # 加载数据
        file_path = r'D:\zhenfeng zhou\The Civil War\data\1850_1875_all_article.csv'
        df = pd.read_csv(file_path)
        df['Article_body'] = df['Article_body'].astype(str)

        id_content = list(zip(df['Article_id'], df['Article_body']))
        max_workers = 30

        final_result = process_data(id_content, max_workers) 
        df_seg = pd.DataFrame(final_result, columns=['Article_id', 'seg_article_body'])
        df_seg.to_csv(r'D:\zhenfeng zhou\The Civil War\data\1850_1875_all_article_seg.csv',index = False)
        
    except Exception as e:
        print(e)
        print(traceback.format_exc())
        sys.exit(1) 

100%|██████████████████████████████████████████████████████████████████| 11122491/11122491 [214:59:19<00:00, 14.37it/s]
100%|█████████████████████████████████████████████████████████████████▊| 11080573/11122491 [00:31<00:00, 529350.32it/s]