了解Ｎ-Gram如何藉由文本計算機率

為何需要用馬可夫假設簡化語言模型的計算？

原本的語言模型利用貝氏定理間算機率為


$W = (W_1W_2...W_m)$

$P(W_1,W_2,...,W_m) = P(W_1) * P(W_2|W_1) * ... * P(W_m|W_1,...W_{m-1}) $

為何需要引入馬可夫假設使機率簡化為：

$P(W_m|W_1,...,W_{m-1}) = P(W_m|W_{m-n+1},W_{m-n+2},...,W_{m-1})$

In [1]:
print('若一次看到要計算太多字時，會有該組合字詞可能未出現在文本中，導致P(W_m|W_1,W_2,...W_{m-1}) = 0')

若一次看到要計算太多字時，會有該組合字詞可能未出現在文本中，導致P(W_m|W_1,W_2,...W_{m-1}) = 0


# 以Bigram模型判斷語句是否合理

已知的機率值有

1. p(i|start) = 0.25
2. p(english | want) = 0.0011
3. p(food|english) = 0.5
4. p(end|food) = 0.68
5. p(want|start) = 0.25
6. p(english|i) = 0.0011

In [2]:
import numpy as np
import pandas as pd
words = ['i', 'want', 'to', 'eat', 'chinese', 'food', 'lunch', 'spend']
word_cnts = np.array([2533, 927, 2417, 746, 158, 1093, 341, 278]).reshape(1, -1)
df_word_cnts = pd.DataFrame(word_cnts, columns= words)
df_word_cnts

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
0,2533,927,2417,746,158,1093,341,278


In [3]:
# 紀錄當前字與前一字詞存在的頻率
bigram_word_cnts = [[5, 827, 0, 9, 0, 0, 0, 2], 
                    [2, 0, 608, 1, 6, 6, 5, 1], 
                    [2, 0, 4, 686, 2, 0, 6, 211],
                    [0, 0, 2, 0, 16, 2, 42, 0],
                    [1, 0, 0, 0, 0, 82, 1, 0],
                    [15, 0, 15, 0, 1, 4, 0, 0],
                    [2, 0, 0, 0, 0, 1, 0, 0],
                    [1, 0, 1, 0, 0, 0, 0, 0]]
df_bigram_word_cnts = pd.DataFrame(bigram_word_cnts, columns= words, index= words)
df_bigram_word_cnts

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
i,5,827,0,9,0,0,0,2
want,2,0,608,1,6,6,5,1
to,2,0,4,686,2,0,6,211
eat,0,0,2,0,16,2,42,0
chinese,1,0,0,0,0,82,1,0
food,15,0,15,0,1,4,0,0
lunch,2,0,0,0,0,1,0,0
spend,1,0,1,0,0,0,0,0


In [4]:
# 給出總詞頻（df_word_cnts）與bigram模型的詞頻(df_bigram_word_cnts)所計算的配對機率（ex: p(want|i)）
df_bigram_prob = df_bigram_word_cnts.copy()

df_bigram_prob = df_bigram_prob / df_word_cnts.values.T
df_bigram_prob

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
i,0.001974,0.32649,0.0,0.003553,0.0,0.0,0.0,0.00079
want,0.002157,0.0,0.655879,0.001079,0.006472,0.006472,0.005394,0.001079
to,0.000827,0.0,0.001655,0.283823,0.000827,0.0,0.002482,0.087298
eat,0.0,0.0,0.002681,0.0,0.021448,0.002681,0.0563,0.0
chinese,0.006329,0.0,0.0,0.0,0.0,0.518987,0.006329,0.0
food,0.013724,0.0,0.013724,0.0,0.000915,0.00366,0.0,0.0
lunch,0.005865,0.0,0.0,0.0,0.0,0.002933,0.0,0.0
spend,0.003597,0.0,0.003597,0.0,0.0,0.0,0.0,0.0


根據已給的機率所計算出的機率(df_bigram_prob),試著判斷下列兩個句子哪個較為合理

s1 = 'i want english food'

s2 = 'want i english food'

In [5]:
p_s1 = 0.25 * df_bigram_prob.loc['i', 'want'] * 0.0011 * 0.5 * 0.68
p_s2 = 0.25 * df_bigram_prob.loc['want', 'i'] * 0.0011 * 0.5 * 0.68
print('P(S1) = {:.8f}, P(S2) = {:.8f}'.format(p_s1, p_s2))

P(S1) = 0.00003053, P(S2) = 0.00000020


p(s1) > p(s2) , s1 is more reasonable.