In [2]:
import re
from tqdm import tqdm
import pickle
from functools import reduce
import mafan
from mafan import text
import itertools
bos = " <bos> "
eos = " <eos> "

# Tokenizer Functions

## Sentence Tokenizer

In [3]:
def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

## Simplified Chinese Tokenizer

Below is the code for simplified to traditional mapping dictionary.

We have a large dictionary *conversions.txt* that includes words, characters, common phrases, locations and idioms. Each entry contains the traditional chinese word and simplified chinese word.

In [4]:
infile = open("conversions.txt", "r+", encoding="utf-8")

s2t_dict = dict()

for line in infile:
    line = line.rstrip()
    arr = line.split()
    trad = arr[0]
    sim = arr[1]
    if sim not in s2t_dict:
        s2t_dict[sim] = [trad]
    else:
        s2t_dict[sim].append(trad)
s2t_dict['-'] = ['-']

Tokeniser is used for identifying dictionary words and phrases in the input sentence. We always prefer longer phrases because it gives more meaning and less translation mappings. Hence we use Byte Pair Encoding (BPE) for identifying words, while BPE candidates are constrained by the defined list of vocabs in the dictionary. Since the longest phrase in the dictionary has 8 characters we start with 8-character phrases and do it backwards.

In [5]:
def tokenizer(sentence, n = 8):
    '''
    This function tokenizes input sentences according to the dicitionary.
    Input: a sentence or paragraph
    Output: a list of tokens from the input in order according to the original paragraph; a list of non-chinese characters from the original text.
    '''
    text, charList = prepare(sentence)
    token_list = []
    input_text = text
    for k in range(n, 0, -1):
        candidates = [input_text[i:i + k] for i in range(len(input_text) - k + 1)]
        for candidate in candidates:
            if candidate in s2t_dict:
                token_list.append(candidate)
                input_text = re.sub(candidate, '', input_text)
    final = sequencer(token_list, text)
    return final, charList

In [6]:
def output_list(sentence_list, char_list):
    count = 0
    original = [] # sentence we want to output
    
    for word in sentence_list:
        if "-" in word:
            original.append(list(char_list[count]))
            count += 1
        else:
            original.append(word)
    return original

In [7]:
def output(sentence, char_list):
    count = 0
    original = "" # sentence we want to output

    for char in list(sentence):
        if char == "-":
            original += char_list[count] # append character if non-chinese
            count += 1
        else:
            original += char # append chinese
    return original

In [8]:
def prepare(sentence):
    new = "" # input to your tokenizer
    char_list = [] # punct / english to be omitted

    for char in list(sentence):
        if text.identify(char) is mafan.NEITHER:
            new += "-" # sub - with non-chinese chars
            char_list.append(char)
        else:
            new += char

    return new, char_list

In [9]:
def sequencer(tokens, example):

    flags = [1] * len(example)
    sequence = []
    for token in tokens:
        for match in re.finditer(token, example):
            location = (token, match.span()[0], match.span()[1])
            valid = reduce(lambda x,y:x*y, flags[location[1]:location[2]])
            if valid:
                sequence.append(location)
                for i in range(location[1], location[2]):
                    flags[i] = 0
            else:
                continue
    sequence.sort(key=lambda x: x[1])
    result = [x[0] for x in sequence]
    return result

## Corpus Preparation

First, we need to prepare our corpus.
1. We will add paddings (sentinels) to our sentences.
2. Take one sentence at a time.
3. Change non-chinese words to FW to avoid data explosion.
4. Slice the n-grams and add them to dictionary

In [10]:
def add_stuff(order):
    '''
    This function divides the corpus into n-grams and stores them in dictionary.
    Input: order of n-gram (like 2 for bi-gram)
    Output: none
    '''
    infile = open("hk-zh.txt", "r+") # this contains our corpus
    start_padding = bos * order # add padding
    end_padding = eos * order

    for line in tqdm(infile, total=1314726):
        line = line.rstrip()
        sentences = list(zng(line)) # tokenize sentence by sentence
        for sentence in sentences:
            candidate = start_padding + sentence + end_padding # form sentence
            word_list = candidate.split()
            word_list_tokens = []
            for word in word_list:
                if not(bool(re.match('^[a-zA-Z0-9]+$', word))):
                    word_list_tokens.append(word) # add if not chinese
                else:
                    word_list_tokens.append("FW") # turn non-chinese (except punc) to FW
            word_list = word_list_tokens
            ordered = [word_list[i:i + order] for i in range(1, len(word_list) - order)] # extract n-grams through slicing
            # for each ngram, convert to tuple and add to dictionary
            for ngram in ordered:
                ngram = tuple(ngram)
                if ngram not in corpus:
                    corpus[ngram] = 1
                else:
                    corpus[ngram] += 1

Let's say you want to extract till trigrams.

We want to do 3 iterations, for trigram, bi-gram and then unigram. Each iteration takes 2 minutes. This is only time-consuming part of this code. Once you prep the dictionary, you don't need to do this again.

In [11]:
corpus = dict()
# start_order = 2
# for i in range(start_order, 0, -1):
#     add_stuff(i)

Once you made the dictionary, dump it into a pickle.

In [12]:
# import pickle
# with open('corpus.pkl', 'wb') as handle:
#     pickle.dump(corpus, handle)

Here's a way to load a pickle so you don't need to process data everytime.

In [13]:
with open('corpus.pkl', 'rb') as fp:
    corpus = pickle.load(fp)

# Making Candidate Lists

1. Tokenize the input.
2. Check the mappings of each input.
3. Add all possible mappings to candidate list.

In [14]:
def convert(sentence):
    '''
    Returns list of possible mappings.
    Input: Simplified chinese sentence
    Output: List of lists. Each list has a set of possible traditional chinese tokens
    '''
    tokens, char_list = tokenizer(sentence)
    candidate_list = []
    for token in tokens:
        candidate_list.append(s2t_dict[token])
    candidate_list = output_list(candidate_list, char_list)
    return(candidate_list)

# Maximum log-likelihood calculations

In [15]:
num_tokens = 4526000 # total number of tokens in corpus

def prob(word_list):
    '''
    Computes the log likelihood probability.
    Input: A sequence of words in form of list
    Output: Log probabilties
    '''
    word_list = tuple(word_list) # change word list to tuple
    if word_list in corpus:
        # word found in dictionary
        numerator = corpus[word_list] # get the frequency of that word list
        denominator = num_tokens # let denominator be num tokens
        # cutoff the last word and check whether it's in corpus
        if len(word_list[:-1]) > 1 and word_list[:-1] in corpus:
            denom_list = word_list[:-1]
            denominator = corpus[denom_list]
        return log(numerator / denominator) # log of prob
    else:
        word_list = list(word_list) # convert it back to list
        k = len(word_list) - 1 # backoff, reduce n gram length
        if k > 0:
            # recursive function, divide the sequence into smaller n and find probs
            probs = [prob(word_list[i:i + k]) for i in range(len(word_list) - k + 1)]
            return sum(probs)
        else:
            # we found an unseen word
            if not(bool(re.match('^[a-zA-Z0-9]+$', word_list[0]))):
                return log(1 / num_tokens) # return a small probability
            else:
                return prob(["FW"]) # we encountered a non-chinese word

# Backoff Language Model

In [16]:
from math import log
def backoff(sentence, order):
    '''
    Calcuates log likelihood using backoff language model
    Input: Sentence and order of the n-gram
    Output: Log prob of that sentence
    '''
    score = 0
    sentences = list(zng(sentence)) # sentence tokenizer
    for sentence in sentences:
        start_padding = bos * order # beginning padding
        end_padding = eos * order # ending padding
        candidate = start_padding + sentence + end_padding # add paddings
        word_list = candidate.split()
        word_list_tokens = []
        for word in word_list:
            # append only non-chinese words
            if not(bool(re.match('^[a-zA-Z0-9]+$', word))):
                word_list_tokens.append(word)
            else:
                word_list_tokens.append("FW")
        word_list = word_list_tokens
        ordered = [word_list[i:i + order] for i in range(1, len(word_list) - order)] # shingle into n-grams
        probs = [prob(x) for x in ordered] # calculate probabilities
        score += sum(probs) # final answer
    return score

# Translator

In [17]:
def translate(sentence):
    '''
    Translate a given sentence to traditional
    Input: Simplified Sentence
    Output: Traditional Sentence
    '''
    candidates = convert(sentence) # get the candidate lists
    final_sent = ""
    for words in candidates:
        if len(words) > 1:
            # many to one mappings
            score = -50000.0 # start with extreme negative value
            likely = ""
            for candidate in words:
                temp = final_sent
                temp = temp + " "  + candidate # add a candidate to temp sentence
                current_score = backoff(temp, 2) # check perplexity
                if current_score > score:
                    # if performing good, include that
                    score = current_score
                    likely = candidate
            final_sent = final_sent + " " + likely
        else:
            final_sent = final_sent + " " + words[0]
    final_sent = final_sent.replace(" ", "")
    final_sent = add_back_spaces(sentence, final_sent)
    return final_sent

In [18]:
def add_back_spaces(original, current):
    current_list = list(current)
    original_list = list(original)
    count = 1
    for index, char in enumerate(original_list):
        if char == " ":
            current_list[index - count] += " "
            count += 1
    current = "".join(current_list)
    return current

In [19]:
sentence = "姚松炎、周庭势被「DQ」? 泛民质疑，政府再取消参选人资格涉政治筛选，要求律政司司长郑若骅解释法律理据。 有报道指，据全国人大常委会就《基本法》第一百零四条进行的释法，代表泛民参选立法会港岛及九龙西补选的香港众志周庭和被「DQ」前议员姚松炎，势被取消参选资格。律政司表示，法律政策专员黄惠冲将于稍后时间与泛民议员会面，确实时间待定。 民主派议员前晚在律政中心外静坐要求与律政司司长郑若骅会面不果后，昨在立法会召开记者招待会，要求郑就撤销参选人资格的理据，及其给予选举主任的法律意见作出详细交代。公民党议员郭荣铿批评，郑不向公众交代的做法是「冇承担，冇责任」的表现，不能只把责任交托予公务员。 人民力量主席陈志全说，如参选设政治筛选是「假民主」的表现，形容事件「将令香港的民主制度倒退二十年」。民主党主席胡志伟亦担心，事件将令香港步向「一国一制」。 公共专业联盟议员莫乃光提到，泛民将发起「一人一信」行动，向政府表达「反对DQ」的声音。泛民明日将在公民广场举行集会，要求政府立即核实姚、周二人的参选资格。 姚松炎则表示，如政府引用《基本法》第一百零四条的释法取消自己的参选资格，理据并不充分。他认为，释法只是指出自己的宣誓无效，不会被重新安排宣誓，但无说明自己不能在同一届立法会其他界别的议席宣誓，亦未提及自己的参选权会被剥夺。他反问不能重新宣誓是否「一生都不能宣誓」，现有如「剥夺自己政治权利终身」。 行政会议成员汤家骅在电台节目上说，政府如援以《基本法》第一百零四条的释法及去年取消议员法律资格的法律决定，并不能作为撤销姚松炎参选资格的法律理据。他解释，上次的释法及法律决定是按照「成为议员的就职资格」而作出，「就职资格」与「参选资格」不相同，违反就职资格不能被引申为不可参选。 对于周庭所属的香港众志主张民主自决，并将「港独」列为选项，汤家骅承认，倘若自决是《基本法》框架外的主张，她被取消资格的机会的确会「多少少」。但他指出，政府需仔细研究香港众志的党纲，「不能单靠党纲文字，便裁定他们不拥护《基本法》」。 公民党党魁杨岳桥指，释法内容并非剥夺政治权利，倘政府以释法为理据，限制姚松炎出选，做法是「移船就磡」，更反问「是否有人诚心跪玻璃悔改，都不能透过补选再次入闸？」他期望郑若骅尽快交代，政府不要再利用公务员或选举主任作政治决定。 前学联常委司徒子朗则在其facebook贴文，希望资助五万元，寻找有志之士参加九龙西补选，担任「真PLAN B」，因为他认为将替代姚松炎出选的民主党袁海文的支持度不足。 民建联郑泳舜及报称独立的前青年民建联成员、物理治疗师蔡东洲亦报名参选九龙西。港岛区的参选人还有新民党陈家珮及任亮宪。"
a = translate(sentence)
a

'姚松炎、周庭勢被「DQ」? 泛民質疑，政府再取消參選人資格涉政治篩選，要求律政司司長鄭若驊解釋法律理據。 有報道指，據全國人大常委會就《基本法》第一百零四條進行的釋法，代表泛民參選立法會港島及九龍西補選的香港眾志周庭和被「DQ」前議員姚松炎，勢被取消參選資格。律政司表示，法律政策專員黃惠沖將於稍後時間與泛民議員會面，確實時間待定。 民主派議員前晚在律政中心外靜坐要求與律政司司長鄭若驊會面不果後，昨在立法會召開記者招待會，要求鄭就撤銷參選人資格的理據，及其給予選舉主任的法律意見作出詳細交代。公民黨議員郭榮鏗批評，鄭不向公眾交代的做法是「冇承擔，冇責任」的表現，不能只把責任交托予公務員。 人民力量主席陳志全說，如參選設政治篩選是「假民主」的表現，形容事件「將令香港的民主制度倒退二十年」。民主黨主席胡志偉亦擔心，事件將令香港步向「一國一制」。 公共專業聯盟議員莫乃光提到，泛民將發起「一人一信」行動，向政府表達「反對DQ」的聲音。泛民明日將在公民廣場舉行集會，要求政府立即核實姚、週二人的參選資格。 姚松炎則表示，如政府引用《基本法》第一百零四條的釋法取消自己的參選資格，理據並不充分。他認為，釋法只是指出自己的宣誓無效，不會被重新安排宣誓，但無說明自己不能在同一屆立法會其他界別的議席宣誓，亦未提及自己的參選權會被剝奪。他反問不能重新宣誓是否「一生都不能宣誓」，現有如「剝奪自己政治權利終身」。 行政會議成員湯家驊在電臺節目上說，政府如援以《基本法》第一百零四條的釋法及去年取消議員法律資格的法律決定，並不能作為撤銷姚松炎參選資格的法律理據。他解釋，上次的釋法及法律決定是按照「成為議員的就職資格」而作出，「就職資格」與「參選資格」不相同，違反就職資格不能被引申為不可參選。 對於周庭所屬的香港眾志主張民主自決，並將「港獨」列為選項，湯家驊承認，倘若自決是《基本法》框架外的主張，她被取消資格的機會的確會「多少少」。但他指出，政府需仔細研究香港眾志的黨綱，「不能單靠黨綱文字，便裁定他們不擁護《基本法》」。 公民黨黨魁楊岳橋指，釋法內容並非剝奪政治權利，倘政府以釋法為理據，限制姚松炎出選，做法是「移船就磡」，更反問「是否有人誠心跪玻璃悔改，都不能透過補選再次入閘？」他期望鄭若驊儘快交代，政府不要再利用公務員或選舉主任作政治決定。 前學聯常委司徒子朗則在其facebook貼文，希望資助五萬