先找出所有 term 的 gramma 次數 * gramma 的長度（加權值），用一個 標準差 篩掉
並在 term 的 gramma 中選出好的 pattern 

In [1]:
import akl
import math
import operator
from pprint import pprint
from collections import defaultdict

In [2]:
AKL = list(akl.akl.keys())

In [3]:
PRONS = set([line.strip('\n') for line in open('prons.txt')])

In [4]:
HIFREWORDS = [line.split('\t') for line in open('HiFreWords')][0]

- create_sentence_pattern_list: 將 o.txt 轉 雙層list

In [5]:
def create_sentence_pattern_list(input_pat):
    pattern = []
    final = []
    for i in input_pat:
        if i != '':
            pattern.append(i)
        else:
            final.append(pattern.copy())
            pattern.clear()

    # Last one
    final.append(pattern)
    return final

In [6]:
test = 'all that remains for me to do is to say good-bye .\nwww'.strip('\n').split('\n')
create_sentence_pattern_list(test)

[['all that remains for me to do is to say good-bye .', 'www']]

In [7]:
# Corpus
corpus = open('o.txt', 'r').read().strip('\n').split('\n')
corpus = create_sentence_pattern_list(corpus)

- term : ABILITY 
- grammar : N to v
- pattern : its bulk and ability to fly

In [8]:
def build_pattern_dict(corpus):
    pattern_dict = defaultdict(lambda: defaultdict(list))
    for _object in corpus:
        sent = _object[0]
        for c in _object[1:]:
            term, grammar, pattern = c.split('\t')
            pattern_dict[term][grammar] += [pattern]       
    return pattern_dict

In [9]:
pattern_dict = build_pattern_dict(corpus)

In [37]:
pattern_dict['USEFUL']['ADJ to v']

['useful to have',
 'useful to keep',
 'useful to have',
 'useful to have',
 'useful to bring',
 'useful to bring along',
 'more useful to hand',
 'more useful to hand out',
 'useful to help',
 'useful to be',
 'useful to exchange',
 'useful to assign',
 'useful to know',
 'useful to evaluate',
 'useful to restrain',
 'useful to tune',
 'useful to tune up',
 'useful to once again make',
 'useful to ask',
 'useful to work',
 'useful to borrow',
 'useful to look',
 'useful to get',
 'useful to indicate',
 'useful to have',
 'useful to portray',
 'useful to remember',
 'useful to amass',
 'useful to understand',
 'useful to understand']

- ['ABILITY']['N to v'] 在 corpus 出現 次數

In [22]:
for grammar, sentences in pattern_dict['USEFUL'].items():
    print(grammar,len(sentences))

ADJ in n 18
ADJ to v 30
ADJ as n 8
ADJ for n 20
ADJ on n 3
ADJ at n 1
ADJ to n 14
ADJ after n 1
ADJ in n with n 1
ADJ for n to v 1
V to v 1


In [32]:
def computeScore(word, sent):
    global PRONS
    global HiFreWords
    
    word = word.lower()
    sent = sent.lower().split()
    length = len(sent)
    
    locationOfWord = -1 if word not in sent else sent.index(word) 
    hiFreWordsScore = len([w for w in sent if w not in HIFREWORDS])
    pronsScore = len([w for w in sent if w in PRONS])
    
    return locationOfWord - hiFreWordsScore - pronsScore

![https://zh.wikipedia.org/wiki/%E6%A8%99%E6%BA%96%E5%B7%AE](樣本標準差.png)

#### (句數 * 文法長度) 後的值 標準差

In [71]:
def get_best_pattern_1(word):
    gramma_avg = 0.0
    stddev = 0.0
    k0 = 1
    
    word = word.upper()
    
    print(word)

    # Total grammar count for the input word
    gramma_count = len(pattern_dict[word].keys())
    
    if gramma_count == 0:
        print('NO RESULT\n')
        return
    
    gra_score_sum = 0.0

    # Calculate sentence length avg of a grammar
    for gra, sen_list in pattern_dict[word].items():
        sen_count = len(sen_list)
        gra_word_len = len(gra.split(' '))
        gra_score = sen_count * gra_word_len
        gra_score_sum += gra_score
    gramma_avg =  gra_score_sum/gramma_count
    
    # 樣本標準差 分母為 n-1 
    if ( gramma_count - 1 ) == 0: 
        print('NO RESULT\n')
        return

    # Calculate stddev
    for gramma, sentences in pattern_dict[word].items():
        sen_count = len(sentences)
        gra_word_len = len(gramma.split(' '))
        gra_score = sen_count * gra_word_len
        stddev += (gra_score - gramma_avg) ** 2
        
    stddev = math.sqrt(stddev / gramma_count - 1)
        
#     if stddev == 0:
#         print('NO RESULT\n')
#         return

    
    # Filter good grammar
    for gramma, sentences in pattern_dict[word].items():
    
        sen_count = len(sentences)
        gra_word_len = len(gramma.split(' '))
        gra_score = sen_count * gra_word_len
        strength = gra_score - stddev * 1
        if not strength > k0:
            continue
            
        best_score = -999.9
        best_sentence = ''

        # Find Good Dictionary Example
        for sentence in sentences:
            score = computeScore(word, sentence)
            if score >= best_score:
                best_score = score
                best_sentence = sentence

        print('%s (%d) %s' % (gramma, sen_count, best_sentence))
    print()


#### 句數作標準差 文法長當加權值 

In [84]:
def get_best_pattern_2(word):
    gramma_avg = 0.0
    stddev = 0.0
    k0 = 1
    
    word = word.upper()
    
    print(word)

    # Total grammar count for the input word
    gramma_count = len(pattern_dict[word].keys())
    
    if gramma_count == 0:
        print('NO RESULT\n')
        return
    
    gra_score_sum = 0.0

    # Calculate sentence length avg of a grammar
    for gra, sen_list in pattern_dict[word].items():
        gra_score = len(sen_list)
        gra_score_sum += gra_score
    gramma_avg =  gra_score_sum/gramma_count
    
    # 樣本標準差 分母為 n-1 
    if ( gramma_count - 1 ) == 0: 
        print('NO RESULT\n')
        return

    # Calculate stddev
    for gramma, sentences in pattern_dict[word].items():
        gra_score = len(sentences)
        stddev += (gra_score - gramma_avg) ** 2
        
    stddev = math.sqrt(stddev / gramma_count - 1)
        
#     if stddev == 0:
#         print('NO RESULT\n')
#         return

    
    # Filter good grammar
    for gramma, sentences in pattern_dict[word].items():
        sen_count = len(sentences)
        gra_word_len = len(gramma.split(' '))
        gra_score = sen_count * gra_word_len
        strength = gra_score - stddev * 1
        if not strength > k0:
            continue
            
        best_score = -999.9
        best_sentence = ''

        # Find Good Dictionary Example
        for sentence in sentences:
            score = computeScore(word, sentence)
            if score >= best_score:
                best_score = score
                best_sentence = sentence

        print('%s (%d) %s' % (gramma, sen_count, best_sentence))
    print()


In [85]:
get_best_pattern_1('useful')

USEFUL
ADJ in n (18) very useful in business
ADJ to v (30) useful to have
ADJ for n (20) especially useful for batch operations
ADJ to n (14) useful to management



In [86]:
get_best_pattern_2('useful')

USEFUL
ADJ in n (18) very useful in business
ADJ to v (30) useful to have
ADJ as n (8) useful as a debugging aid
ADJ for n (20) especially useful for batch operations
ADJ to n (14) useful to management



In [78]:
get_best_pattern_1('ability')

ABILITY
N to v (468) its bulk and ability to fly



In [79]:
get_best_pattern_2('ability')

ABILITY
N to v (468) its bulk and ability to fly



In [80]:
get_best_pattern_1('classify')

CLASSIFY
V into n (8) are classified into groups
V as n (12) are classified as action
V n (20) can manually classify these content items



In [81]:
get_best_pattern_2('classify')

CLASSIFY
V by n (3) classified by BS 2916
V into n (8) are classified into groups
V as n (12) are classified as action
V n (20) can manually classify these content items



In [87]:
get_best_pattern_1('discuss')

DISCUSS
V in n (57) will discuss in detail
V n (270) concerned may have and discuss them



In [88]:
get_best_pattern_2('discuss')

DISCUSS
V in n (57) will discuss in detail
V n (270) concerned may have and discuss them
V adv (31) will discuss later
V wh to v (15) discuss how to eliminate



### 以 disscuss 來看 get_best_pattern_2 在篩選時 才去 乘 gramma 長度 ，確實可以將較長的pramma抓出