<a href="https://colab.research.google.com/github/raynardj/python4ml/blob/master/experiments/why_language_is_knowledge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why language is knowledge

> In China, the word "reading books"(读书) literally means "study", acquiring new knowledge. This can not be more precise for language modeling task for a model

## Warm up Quiz

To understand what is "language modeling", let's start with some warm up quiz
### Level 1
```
耄__
秦__皇
龟____跑
禁止在室内吸__
The rain in the spain mainly stays in the ____
```
### Level 2
```
新__
__尘_____
食用海鲜导致二次复发的疫情再次点燃人们对新__的恐惧
食用海鲜导致二次复发的疫情的新__再次点燃人们的恐惧
```

### Level 3
```
八星八箭指的是____的切工

_____唯一用物理计量单位作为品牌的奢侈品手表

文艺复兴的发祥地是意大利的_______

泾渭分明中的渭河是__河的支流

环法自行车赛的终点是巴黎的_____大街

大动干戈的干， 指的是______
```

## Play with Transformers

Before we terribly worried ourselves with the complexity, we can play with the state of the art language models. 

It's a cruel thing to learn to program an entire tetris before play a game of it. At least after this session, you'll understand why the other kids talking about shortage of long vertical bars



In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 1.0MB 3.3MB/s 
[K     |████████████████████████████████| 1.1MB 18.1MB/s 
[K     |████████████████████████████████| 3.0MB 30.3MB/s 
[K     |████████████████████████████████| 890kB 46.9MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
import numpy as np
from transformers import AutoModelForMaskedLM,AutoTokenizer,Pipeline,AutoModelForCausalLM

In [3]:
def from_pretrained(tag):
    tokenizer = AutoTokenizer.from_pretrained(tag,use_fast = True)
    model = AutoModelForMaskedLM.from_pretrained(tag)
    return model,tokenizer

## Download tokenizer and model

In [4]:
model,tokenizer = from_pretrained("bert-base-chinese")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=624.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411577189.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Load your function for the task

For this function it solves the following problem (like cloze)

$\hat{x_{i}} = f(\{x_{1},x_{2},...,x_{i-1},x_{i+1},...,x_{n}\}) $

The mechanism of this model is just many matrix multiplications. Which takes a lot more time to explain. But remembering the above function will definitely be enough for you to use a BERT like structure

In [6]:
def predict_pipeline(pipeline):
    """
    Create a function predicting the [MASK]
    """
    def masked_language_modeling(text):
        x = np.array(pipeline.tokenizer(text)['input_ids'])
        y_ = pipeline.predict(text)
        pred_idx = y_[0].argmax(-1)[x==103]
        return pipeline.tokenizer.decode(pred_idx)
    return masked_language_modeling

pipeline = Pipeline(model,tokenizer,task="MaskedLM")
masked_language_modeling_zh = predict_pipeline(pipeline)

In [13]:
masked_language_modeling_zh("泾渭分明中的渭河是[MASK]河的支流")

'黄'

In [14]:
masked_language_modeling_zh("食用海鲜导致二次复发的疫情的新[MASK]再次点燃人们的恐惧")

'闻'

In [20]:
masked_language_modeling_zh("文艺复兴的发祥地是意大利的[MASK][MASK][MASK]萨")

'萨 罗 伦'

In [22]:
masked_language_modeling_zh("四个星期一共有四[MASK]二十八个工作日")

'百'

In [21]:
masked_language_modeling_zh("经[MASK][MASK][MASK]批准的法定节假日")

'国 务 院'

### How to encode input (to index number)
### And Decode output (from index number)

In [53]:
text = "泾渭分明中的渭河是[MASK]河的支流"

tokens = tokenizer.tokenize(text)
tokens

['泾', '渭', '分', '明', '中', '的', '渭', '河', '是', '[MASK]', '河', '的', '支', '流']

In [121]:
tokenizer.special_tokens_map

{'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'pad_token': '[PAD]',
 'sep_token': '[SEP]',
 'unk_token': '[UNK]'}

In [54]:
x = tokenizer(text)['input_ids']
print(x)

[101, 3814, 3948, 1146, 3209, 704, 4638, 3948, 3777, 3221, 103, 3777, 4638, 3118, 3837, 102]


In [11]:
print(tokenizer.convert_ids_to_tokens(x))

['[CLS]', '泾', '渭', '分', '明', '中', '的', '渭', '河', '是', '[MASK]', '河', '的', '支', '流', '[SEP]']


In [None]:
y_ = pipeline.predict(text)

In [69]:
y_

array([[[ -7.7697077,  -7.8292522,  -7.6659904, ...,  -6.7457013,
          -6.6668267,  -6.741834 ],
        [ -7.068693 ,  -6.9906745,  -7.1348014, ...,  -4.585348 ,
          -1.6766186,  -2.0864406],
        [-10.905392 , -11.537951 , -11.320431 , ...,  -6.6922593,
          -3.9067974,  -4.29532  ],
        ...,
        [-13.253618 , -12.945975 , -13.031917 , ...,  -4.8998117,
         -11.2029   ,  -7.9473567],
        [-13.972485 , -14.342685 , -15.082423 , ...,  -5.287148 ,
          -6.4903975,  -6.277387 ],
        [ -8.830073 ,  -9.006552 ,  -9.276823 , ...,  -6.124199 ,
          -4.6767206,  -5.468654 ]]], dtype=float32)

In [70]:
y_.shape

(1, 16, 21128)

In [71]:
pred_idx = y_[0].argmax(-1)
pred_idx

array([8024, 3814, 3948, 1146, 3209,  704, 4638, 3948, 3777, 3221, 7942,
       3777, 4638, 3118, 3837, 3777])

In [76]:
tokenizer.mask_token,tokenizer.mask_token_id

('[MASK]', 103)

In [72]:
np.array(x)==103

array([False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False])

In [73]:
mask_pred = pred_idx[np.array(x)==103]
mask_pred

array([7942])

In [74]:
tokenizer.decode(mask_pred)

'黄'

### English model too

In [55]:
en_model,en_tokenizer = from_pretrained("bert-base-uncased")
masked_language_modeling_en = predict_pipeline(Pipeline(en_model,en_tokenizer))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [63]:
en_tokenizer.tokenize("we continue to focus on other overrated problems, but tolerate his overdose of amphetamine")

['we',
 'continue',
 'to',
 'focus',
 'on',
 'other',
 'over',
 '##rated',
 'problems',
 'tolerate',
 'his',
 'overdose',
 'of',
 'amp',
 '##het',
 '##amine']

In [25]:
masked_language_modeling_en("The rain in the spain stays mainly in the [MASK].")

'summer'

### Model learning human bias from model

In [44]:
masked_language_modeling_en("Chinese people are usually [MASK] . ")

'bilingual'

In [52]:
masked_language_modeling_en("Latino people are [MASK]. ")

'excluded'

In [45]:
masked_language_modeling_en("Gay people are usually [MASK] . ")

'excluded'

In [50]:
masked_language_modeling_en("Black people are [MASK].")

'everywhere'

In [51]:
masked_language_modeling_en("White people are [MASK]. ")

'free'

## Generative model
> Casual Language Modeling, (Guessing the next word, like a smart input method)

### Papers
* [Improving Language Understanding
by Generative Pre-Training (gpt1)](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
> Pure & simple casual LM is powerful!

* [Language Models are Unsupervised Multitask Learners(gpt2)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
> Fine-tune task as casual LM

* [Language Models are Few-Shot Learners(gpt3)](https://arxiv.org/abs/2005.14165)
> Biggest GPT2 ```1.5 Bn``` parameter,  GPT3 ```175.0 Bn```.

$\hat{x_{i}} = f(\{x_{1},x_{2},...,x_{i-1}\}) $



In [71]:
## Enters GPT
from transformers import AutoModelForCausalLM

In [72]:
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2",use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




In [100]:
model.pad_token_id = tokenizer.eos_token_id

In [116]:
start_text = "To be or not to be, that is the question "

In [117]:
def write_sentence(start_text):
    x = tokenizer_gpt2(start_text,return_tensors="pt")['input_ids'][:,-512:]
    y_pred = gpt2.generate(x,max_length=100)
    return tokenizer_gpt2.decode(list(y_pred[0]))

In [119]:
write_sentence("2 times 6 is")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'2 times 6 is the same as the previous one.\n\nThe second time, the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of times the number of'

In [120]:
write_sentence("The T800 robot in Terminator movie was play by")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'The T800 robot in Terminator movie was play by the same actor who played the character in the original Terminator movie.\n\nThe T800 robot in Terminator movie was play by the same actor who played the character in the original Terminator movie. The T800 robot in Terminator movie was played by the same actor who played the character in the original Terminator movie. The T800 robot in Terminator movie was played by the same actor who played the character in the original Terminator movie. The T800 robot in Terminator'

In [118]:
start_text = write_sentence(start_text)
start_text

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'To be or not to be, that is the question \xa0of what is the meaning of the word "fool." \xa0I think that the word "fool" is a very important word in the English language. \xa0It is used to describe a person who is not a fool. \xa0It is used to describe a person who is not a fool. \xa0It is used to describe a person who is not a fool. \xa0It is used to describe a person'

## Other pipelines
* [Attention is all you need (Vaswani et al)(Transformer model)](https://arxiv.org/pdf/1706.03762.pdf)
> Video kill the radio star, transformers killed RNN based model.

In [7]:
from  transformers import pipeline

In [13]:
translation = pipeline("translation_en_to_de")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




In [15]:
translation("The villainy you teach me I will execute—and it shall go hard but I will better the instruction.")

[{'translation_text': 'Die Bösewicht, die du mir lehrst, werde ich ausführen—und es wird hart gehen, aber ich werde die Anweisung besser machen.'}]

## Zero-shot learning

In [17]:
zero_shot = pipeline("zero-shot-classification")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=908.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1629486723.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BartForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
zero_shot("""President Obama is utterly optimistic, 
which he gave full credit to being a Hawaiian when 
he addressed all the white house interns
""",
        candidate_labels = ["happy","black","an intern",
                            "from east coast","from west coast"],
        hypothesis_template="Obama is {}.",
        multi_class=True,
)

{'labels': ['happy',
  'from west coast',
  'an intern',
  'from east coast',
  'black'],
 'scores': [0.9978259801864624,
  0.6224255561828613,
  0.006976888980716467,
  0.0027586454525589943,
  0.0025903750211000443],
 'sequence': 'President Obama is utterly optimistic, \nwhich he gave full credit to being a Hawaiian when \nhe addressed all the white house interns\n'}

In [23]:
zero_shot("""We have the pig who's so lazy and built a straw house, 
and we also have the pig who though wooden house is strong enough, 
but when the big bad wolf came with burning torch and everything, 
only the last pig with the brick house stands, 
and saved all the piggy brothers""",
          candidate_labels = ["1","2","3","4"],
          hypothesis_template="There are {} pigs in total",
          multi_class=False,
          )

{'labels': ['2', '3', '4', '1'],
 'scores': [0.45129668712615967,
  0.2386798858642578,
  0.1752108782529831,
  0.13481254875659943],
 'sequence': "We have the pig who's so lazy and built a straw house, \nand we also have the pig who though wooden house is strong enough, \nbut when the big bad wolf came with burning torch and everything, \nonly the last pig with the brick house stands, \nand saved all the piggy brothers"}