# Assignment-02, Probability Model A First Look: An Introduction of Language Model

In [5]:
import pandas as pd
import numpy as np
import jieba
import os
import re
import pickle
import random
from collections import Counter
from functools import reduce
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Review the course online programming code. 

#### Cleaning 

In [16]:
def token(string):
    return ' '.join(re.findall('[\w|\d]+', string))

def _to_cleanstr(all_articles):
    all_articles = [token(str(a)) for a in all_articles]
    text = ''
    for a in all_articles:
        text += a
    return text

def cut(string): 
    return list(jieba.cut(string))

def text_to_corpus(text):
    ALL_TOKENS = cut(text)
    valid_tokens = [t for t in ALL_TOKENS if t.strip() and t != 'n']
    return valid_tokens

In [26]:
df = pd.read_csv('../datasource-master/sqlResult_1558435.csv', encoding='gb18030')

In [27]:
df.columns

Index(['id', 'author', 'source', 'content', 'feature', 'title', 'url'], dtype='object')

In [28]:
text = _to_cleanstr(df['content'].tolist())
print('length of text: {}'.format(len(text)))

length of text: 37412506


In [30]:
valid_tokens = text_to_corpus(text)
print('Number of Words: {}'.format(len(valid_tokens)))

Number of Words: 17221785


#### One Gram

In [33]:
words_count = Counter(valida_tokens)
frequences_all = [f for w, f in words_count.most_common()]
frequences_sum = sum(frequences_all)

In [17]:
def get_prob(word): 
    esp = 1 / frequences_sum
    if word in words_count: 
        return words_count[word] / frequences_sum
    else:
        return esp

In [18]:
def product(numbers):
    return reduce(lambda n1, n2: n1 * n2, numbers)

In [19]:
def language_model_one_gram(string):
    words = cut(string)
    return product([get_prob(w) for w in words])

In [39]:
need_compared = [
    "今天晚上请你吃大餐，我们一起吃日料 明天晚上请你吃大餐，我们一起吃苹果",
    "真事一只好看的小猫 真是一只好看的小猫",
    "我去吃火锅，今晚 今晚我去吃火锅"
]

for s in need_compared:
    s1, s2 = s.split()
    p1, p2 = language_model_one_gram(s1), language_model_one_gram(s2)
    
    better = s1 if p1 > p2 else s2
    
    print('{} is more possible'.format(better))
    print('-'*4 + ' {} with probility {}'.format(s1, p1))
    print('-'*4 + ' {} with probility {}'.format(s2, p2))

明天晚上请你吃大餐，我们一起吃苹果 is more possible
---- 今天晚上请你吃大餐，我们一起吃日料 with probility 6.279484454158278e-50
---- 明天晚上请你吃大餐，我们一起吃苹果 with probility 5.1533768284792506e-48
真是一只好看的小猫 is more possible
---- 真事一只好看的小猫 with probility 2.873219522813593e-25
---- 真是一只好看的小猫 with probility 1.0935351206452033e-21
今晚我去吃火锅 is more possible
---- 我去吃火锅，今晚 with probility 6.876097222574346e-26
---- 今晚我去吃火锅 with probility 1.1841866800627252e-18


#### Two-Grams 

In [40]:
valid_tokens = [str(t) for t in valid_tokens] 
all_2_grams_words = [''.join(valid_tokens[i:i+2]) for i in range(len(valid_tokens[:-2]))]

In [13]:
_2_gram_sum = len(all_2_grams_words)
_2_gram_counter = Counter(all_2_grams_words)

In [20]:
def get_combination_prob(w1, w2):
    if w1 + w2 in _2_gram_counter: return _2_gram_counter[w1+w2] / _2_gram_sum
    else:
        return 1 / _2_gram_sum
    
def get_prob_2_gram(w1, w2):
    return get_combination_prob(w1, w2) / get_prob(w1)

def langauge_model_of_2_gram(sentence):
    sentence_probability = 1
    
    words = cut(sentence)
    
    for i, word in enumerate(words):
        if i == 0: 
            prob = get_prob(word)
        else:
            previous = words[i-1]
            prob = get_prob_2_gram(previous, word)
        sentence_probability *= prob
    
    return sentence_probability

In [43]:
need_compared = [
    "今天晚上请你吃大餐，我们一起吃日料 明天晚上请你吃大餐，我们一起吃苹果",
    "真事一只好看的小猫 真是一只好看的小猫",
    "今晚我去吃火锅 今晚火锅去吃我",
    "洋葱奶昔来一杯 养乐多绿来一杯"
]

for s in need_compared:
    s1, s2 = s.split()
    p1, p2 = langauge_model_of_2_gram(s1), langauge_model_of_2_gram(s2)
    
    better = s1 if p1 > p2 else s2
    
    print('{} is more possible'.format(better))
    print('-'*4 + ' {} with probility {}'.format(s1, p1))
    print('-'*4 + ' {} with probility {}'.format(s2, p2))

今天晚上请你吃大餐，我们一起吃日料 is more possible
---- 今天晚上请你吃大餐，我们一起吃日料 with probility 6.895905640955031e-28
---- 明天晚上请你吃大餐，我们一起吃苹果 with probility 5.516724512764024e-28
真是一只好看的小猫 is more possible
---- 真事一只好看的小猫 with probility 1.6570998748154123e-19
---- 真是一只好看的小猫 with probility 3.4765951336188093e-16
今晚我去吃火锅 is more possible
---- 今晚我去吃火锅 with probility 6.82225584071837e-14
---- 今晚火锅去吃我 with probility 9.986004768787415e-16
养乐多绿来一杯 is more possible
---- 洋葱奶昔来一杯 with probility 1.0579577386518395e-12
---- 养乐多绿来一杯 with probility 5.806600374258542e-08


## 2. Review the main points of this lesson. 

##### 1. How to Github and Why do we use Jupyter and Pycharm; 

Ans: Github is a coding host platform for version control and collaboration. Jupyter Notebook is an interactive application to create documents containing live codes, visulizations and markdowns. Pycharm is python IDE. For engineering-level coding, please choose Pycharm with its powerful functions like on-the-fly error checking, quick-fixes and prokect management; for live coding and presentation, or incorporating codes, figures and texts, please choose Jupyter; for team colloration and version control, definitely choose Github even you might don't like it...

##### 2. What's the Probability Model?

Ans: A probability model is to assign a probability to each event with a mathematical representation. In Natural Language Processing, it learns the probability of word occurrence based on examples of text and predicts the next word in the sequence given the words that precede it. 

##### 3. Can you came up with some sceneraies at which we could use Probability Model?

Ans: Statistical language modeling, simulation, credit scoring

##### 4. Why do we use probability and what's the difficult points for programming based on parsing and pattern match? 


Ans: The real word is alway complicated and it is hard to summarize all possible patterns or rules. Probability modeling simplifies the real word into hypothesized sample space and assign each event with its statistical distribution.

##### 5. What's the Language Model;

Ans: A statistical language model is a probability distribution over sequences of words. Given such a sequence, it assigns a probability to the whole sequence. 

#####  6. Can you came up with some sceneraies at which we could use Language Model?

Ans: Speech recognition, machine translation, POS tagging, parsing, natural language inference, Q&A system

##### 7. What's the 1-gram language model;

Ans: The probability of a word, is only related to the word itself, and independent of its surrounding words.

##### 8. What's the disadvantages and advantages of 1-gram language model;

Ans: It is easy to understand and calculate, but oversimplifies the real situation.

##### 9.  What't the 2-gram models; 

Ans: The probability of a word, is conditioned on its previous one word.

##### 10. what's the web crawler, and can you implement a simple crawler? 

Ans: Web crawler is a tool to navigate the website and extract useful information based on requirements. It can be simply implemented with Python packages like beautifulsoup, scrapy and requests.

##### 11.  There may be some issues to make our crwaler programming difficult, what are these, and how do we solve them?

Ans: Pages with denied access, limited server capacity, page duplicates caused by poor website architecture, bad internet connection

##### 12. What't the Regular Expression and how to use?

Ans: Regular expressions are a system for matching patterns in text data, please check https://www.rexegg.com/regex-quickstart.html for regular regression cheat sheet.

## 3. Using Wikipedia dataset to finish the language model. 

Step 1: You need to download the corpus from wikipedis:
> https://dumps.wikimedia.org/zhwiki/20190401/

Step 2: You may need the help of wiki-extractor:

> https://github.com/attardi/wikiextractor

Step 3: Using the technologies and methods to finish the language model; 
> 

Step 4: Try some interested sentence pairs, and check if your model could fit them

> 

Step 5: If we need to solve following problems, how can language model help us? 

+ Voice Recognization.
+ Sogou *pinyin* input.
+ Auto correction in search engine. 
+ Abnormal Detection.

In [58]:
paths = ['text/AA/', 'text/AB/', 'text/AC/']
articles = []
for path in paths:
    for filename in os.listdir(path):
        text = ''
        with open(path + filename, 'r') as f:
            for line in f:
                if '<' not in line and 'url' not in line:
                    text = text + ' ' + line.strip()
        articles.append(text)

In [59]:
text = _to_cleanstr(articles)
print('length of text: {}'.format(len(text)))

length of text: 111570737


In [60]:
valid_tokens = text_to_corpus(text)
print('Number of Words: {}'.format(len(valid_tokens)))

Number of Words: 50231401


In [61]:
import pickle

with open('tokens.pkl', 'wb') as f:
    pickle.dump(valid_tokens, f)

## Run from here! 

In [6]:
with open('tokens.pkl', 'rb') as f:
    valid_tokens = pickle.load(f)

In [7]:
len(valid_tokens)

50231401

In [9]:
words_count = Counter(valid_tokens)
frequences_all = [f for w, f in words_count.most_common()]
frequences_sum = sum(frequences_all)

In [10]:
valid_tokens[:10]

['数学', '数学', '是', '利用', '符号语言', '研究', '數量', '结构', '变化', '以及']

In [11]:
valid_tokens = [str(t) for t in valid_tokens] 
all_2_grams_words = [''.join(valid_tokens[i:i+2]) for i in range(len(valid_tokens[:-2]))]

_2_gram_sum = len(all_2_grams_words)
_2_gram_counter = Counter(all_2_grams_words)

In [31]:
need_compared = [
    "人工智能在现实生活中有哪些有趣的应用 人工智能在虚拟生活中有哪些有趣的应用",
    "一名被升职的员工自杀 一名被开除的员工自杀",
    "总冠军诞生的毫无悬念 总冠军输的毫无悬念",
    "只说只想不去做 只坐不说不去想"
]

for s in need_compared:
    s1, s2 = s.split()
    p1, p2 = langauge_model_of_2_gram(s1), langauge_model_of_2_gram(s2)
    
    better = s1 if p1 > p2 else s2
    
    print('{} is more possible'.format(better))
    print('-'*4 + ' {} with probility {}'.format(s1, p1))
    print('-'*4 + ' {} with probility {}'.format(s2, p2))

人工智能在现实生活中有哪些有趣的应用 is more possible
---- 人工智能在现实生活中有哪些有趣的应用 with probility 2.4714269573125757e-23
---- 人工智能在虚拟生活中有哪些有趣的应用 with probility 1.0615538863363064e-27
一名被开除的员工自杀 is more possible
---- 一名被升职的员工自杀 with probility 7.103628279238841e-21
---- 一名被开除的员工自杀 with probility 7.971056255045391e-19
总冠军诞生的毫无悬念 is more possible
---- 总冠军诞生的毫无悬念 with probility 4.595968066056431e-17
---- 总冠军输的毫无悬念 with probility 2.558836274615202e-17
只说只想不去做 is more possible
---- 只说只想不去做 with probility 3.8647164622801056e-20
---- 只坐不说不去想 with probility 1.4301131381521027e-22


### Compared to the previous learned parsing and pattern match problems. What's the advantage and disavantage of Probability Based Methods? 

Ans: The advantage is that Probability modeling simplifies the underlying assumption of sentences structure and decompose the probability of sentences into words. The disadvantage is that, it oversimplifies the dependency beween words in consist of a sentence. In above cases, even though we shuffle the text sequence, the probability results will not change much.

## (Optional)  How to solve *OOV* problem?

If some words are not in our dictionary or corpus. When we using language model, we need to overcome this `out-of-vocabulary`(OOV) problems. There are so many intelligent man to solve this probelm. 

-- 

The first question is: 

**Q1: How did you solve this problem in your programming task?**

Ans: Assign a small probability as 1/n_gram_size

Then, the sencond question is: 

**Q2: Read about the 'Turing-Good Estimator', can explain the main points about this method, and may implement this method in your programming task**

Reference: 
+ https://www.wikiwand.com/en/Good%E2%80%93Turing_frequency_estimation
+ https://github.com/Computing-Intelligence/References/blob/master/NLP/Natural-Language-Processing.pdf, Page-37

> coding in here