<a href="https://colab.research.google.com/github/shuishen112/colab_jupyter/blob/main/gpt_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT 简介

---

(GPT)Generative Pre-Training 包含两个步骤，第一步：基于一个大的语料集合学会一个高质量的语言模型. 第二步：在一些判别模型上进行fine-tune,这和bert的思想一致。

---
## Unsuperivised pre-training

给定tokens $U=\{u_1,...,u_n\}$,语言模型的objective 如下：

$$L_{1}(u)=\sum_i\log p(u_i|u_{i-k},...,u_{i-1};\theta)$$

在GPT中的基本单元是transformer decoder(注意这里和bert不同，bert用的是transformer 中的encoder)。具体流程如下：

$$h_0=UW_e+W_p$$

$$h_l=transformer\_block(h_{l-1}) \forall\in [1,n]$$

$$P(u)=softmax(h_n W_e^T)$$

$U$ is 是context vector, $n$是layer的数量。$W_e$是token embedding, $W_p$是位置embedding

## Supervised fine-tuning

当非监督训练语言模型结束之后，针对具体的监督数据$C$,只需要进行fine-tune就可以获得非常不错的结果。在监督数据$C$中，其instance 为 $x^1,...,x^m$,每个instance 对应一个$y$.将该inputs作为pre-trained model 的输入，最后获得一个transformer block's activation $h_l^m$。这个高阶的表示会放入一个线性输出中:

$$P(y|x^1,...,x^m)=softmax(h_l^mW_y)$$.

最后我们获得以下objective:

$$L_2(C)=\sum\limits_{(x,y)}\log P(y|x^1,...,x^m)$$

## transformer\_block

$transformer\_block$中的encoder和decoder都有mult-head attention layer，只不过在transformer的encoder中，$Q,K,V$都是来自于输入的token。在decoder中，会有一个encoder-decoder attention中，其query matrix来自于它的前一层，其key，value matrix 来自于encoder的输出。







In [4]:
!pip install git+https://github.com/huggingface/transformers
!pip install datasets

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-xsd4nn5z
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-xsd4nn5z
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 32.4MB/s 
[?25hCollecting tokenizers==0.9.2
[?25l  Downloading https://files.p

In [6]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token_id

model = GPT2ForSequenceClassification.from_pretrained('gpt2', return_dict=True)
model.config.pad_token_id = tokenizer.pad_token_id

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
print(inputs)
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits



Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': tensor([[15496,    11,   616,  3290,   318, 13779]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}


In [None]:
from datasets import load_dataset, load_metric
from dataclasses import dataclass,field
from typing import Optional
from transformers import HfArgumentParser
from transformers import TrainingArguments

from transformers import GPT2Tokenizer, GPT2ForSequenceClassification

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token_id
model = GPT2ForSequenceClassification.from_pretrained('gpt2', return_dict=True)
model.config.pad_token_id = tokenizer.pad_token_id


print(model.config)
import random

datasets = load_dataset("glue",'mrpc')
padding = "max_length"

max_length = 200


task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

sentence1_key, sentence2_key = task_to_keys["mrpc"]

def preprocess_function(examples):
    # Tokenize the texts
    args = (
        (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
    )
    result = tokenizer(*args, padding=padding, max_length=max_length, truncation=True)
    return result

# datasets = datasets.map(preprocess_function,batched = True)

# train_dataset = datasets['train']
# eval_dataset = datasets["validation"]
# # Log a few random samples from the training set:
# for index in random.sample(range(len(train_dataset)), 3):
#     print(f"Sample {index} of the training set: {train_dataset[index]}.")

# metric = load_metric("glue","mrpc")


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50256,
  "resid_pdrop": 0.1,
  "return_dict": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}



Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


3668
3668


In [None]:
from transformers import EvalPrediction
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    if data_args.task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        if len(result) > 1:
            result["combined_score"] = np.mean(list(result.values())).item()
        return result
    elif is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}