首先引入一些會需要使用的模塊，在此使用多執行緒來處理單一查詢

In [1]:
from threading import Thread, Lock
import sys
import csv
import re

為了辨識中文與英文新聞標題，定義正規表示法
並且以THREAD_NUM規定會有幾個執行緒

In [2]:
%%time
RE_CJK = re.compile(r'[\u4e00-\ufaff]+', re.UNICODE)
RE_ENG = re.compile(r'[a-zA-Z]+')
RE_ALL = re.compile(r'[a-zA-Z\u4e00-\ufaff]+', re.UNICODE)

THREAD_NUM = 2

CPU times: user 4.54 ms, sys: 106 µs, total: 4.65 ms
Wall time: 4.62 ms


定義一個函數`index_parallel`，將新聞先以正規表示法取出需要的字詞（中文單字或英文單詞）
在此`findall`方法會返回list，以不包含在定義好的正規表示法者斷開

再來因為單一只會有2-gram或3-gram的中文查詢，強制將取出的字詞以2-gram或3-gram存入一個集合結構中（英文則全部存入），最後再存入指定的index欄位中

因為我們會使用多執行緒來處理，所以會建立`THREAD_NUM`個index

In [3]:
def index_parallel(source, start, step, index):
    for i in range(start, len(source), 2):
        cjk_strings = RE_CJK.findall(source[i][1])
        eng_strings = RE_ENG.findall(source[i][1])
        split_string = set()
        for string in cjk_strings:
            split_string.update([string[i:i+2] for i in range(0, len(string))])
            split_string.update([string[i:i+3] for i in range(0, len(string))])
        if eng_strings:
            split_string.update(eng_strings)
        index[i % THREAD_NUM].append(split_string)

`split_line`函數則讀檔並以多執行緒執行`index_parallel`
- index結構的list()數量與`THREAD_NUM`相關

In [4]:
def split_line(filename):
    csvfile = open(filename, 'r', newline='')
    sourcereader = list(csv.reader(csvfile, delimiter=','))
    index = [list(), list()]
    threads = list()
    for i in range(0, THREAD_NUM):
        s = Thread(target=index_parallel, args=(sourcereader, i, THREAD_NUM, index))
        s.start()
        threads.append(s)
    for thread in threads:
        thread.join()

    csvfile.close()

    return index

因為我們是以集合結構儲存新聞標題，為了程式碼簡潔與維護，將三種查詢對應到的操作邏輯獨立出區塊

In [5]:
def or_search(print_list, queries, index_string, thread_index):
    for (i, search_line) in enumerate(index_string):
        if queries & search_line:
            print_list.append(i*THREAD_NUM+thread_index+1)


def and_search(print_list, queries, index_string, thread_index):
    for (i, search_line) in enumerate(index_string):
        if queries < search_line:
            print_list.append(i*THREAD_NUM+thread_index+1)


def not_search(print_list, in_element, notin_element, index_string, thread_index):
    for (i, search_line) in enumerate(index_string):
        if (in_element in search_line
            and not notin_element < search_line):
            print_list.append(i*THREAD_NUM+thread_index+1)

若不是外部腳本則執行運算

In [6]:
%%time
if __name__ == '__main__':

    import argparse
    # python main.py --source source.csv --query query.txt --output output.txt
    parser = argparse.ArgumentParser()
    parser.add_argument('--source',
                        default='source.csv',
                        help='input source data file name')
    parser.add_argument('--query',
                        default='query.txt',
                        help='query file name')
    parser.add_argument('--output',
                        default='output.txt',
                        help='output file name')
    # args = parser.parse_args()
    args = parser.parse_args(['--query', 'query.1.txt'])

    index_string = split_line(args.source)
    with open(args.output, 'w') as o, open(args.query, 'r') as q:
        for query_line in q:
            query_line = re.sub('\n', '', query_line)
            # with Manager() as manager:
            print_list = list()
            threads = list()
            if 'or' in query_line:
                queries = set(re.split(' or ', query_line))
                for i in range(0, THREAD_NUM):
                    s = Thread(target=or_search,
                                args=(print_list, queries, index_string[i], i))
                    s.start()
                    threads.append(s)

            elif 'and' in query_line:
                queries = set(re.split(' and ', query_line))
                for i in range(0, THREAD_NUM):
                    s = Thread(target=and_search,
                                args=(print_list, queries, index_string[i], i))
                    s.start()
                    threads.append(s)
            elif 'not' in query_line:
                queries = re.split(' not ', query_line)
                in_element = queries[0]
                notin_element = set(queries[1:])
                for i in range(0, THREAD_NUM):
                    s = Thread(target=not_search,
                                args=(print_list, in_element, notin_element, index_string[i], i))
                    s.start()
                    threads.append(s)

            for thread in threads:
                thread.join()

            if not print_list:
                print('0', file=o)
            else:
                print_list.sort()
                print(','.join(map(str, print_list)), file=o)


CPU times: user 2min 10s, sys: 38.6 s, total: 2min 49s
Wall time: 3min 5s
