<a href="https://colab.research.google.com/github/royam0820/fastai2-v4/blob/master/GPT2_FR_TextClassification_allocine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI GPT2 model Infos
[Model Info](https://huggingface.co/transformers/model_doc/gpt2.html)

[The Illustrated GPT-2 (Visualizing Transformer Language Models)](http://jalammar.github.io/illustrated-gpt2/#model-output)

GPT-2 is a large transformer-based language model with **1.5 billion parameters**, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. 

 


# Infos Notebook

This notebook is used to fine-tune GPT2 model for text classification using Huggingface transformers library on a custom dataset.

For the dataset, we are using the French AlloCine reviews, which are French reviews on films. 

Main idea: is to use a GPT2 model that has been pretrained with French texts, and then fine-tune this language model on a sub-domain which are the French reviews from AlloCine website.


In [1]:
# !nvidia-smi

In [1]:
! [ -e /content ] && pip install -Uqq fastai  # upgrade fastai on colab

In [2]:
!pip install -Uq transformers
from fastai.text.all import *

In [4]:
# # better display of review text in dataframes
# pd.set_option('display.max_colwidth', None) 

# BelGPT-2 - a GPT-2 model pre-trained on French corpora
This language model has been trained on French texts from different sources: Wikipedia, news, EuroParl texts. Its size is 60Gb.

https://github.com/antoiloui/belgpt2/blob/master/docs/index.md

https://huggingface.co/antoiloui/belgpt2



In [3]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, GPT2Config

In [4]:
model_name = "antoiloui/belgpt2"
ds_name = ''

max_len = 512
bs = 8
val_bs = bs*2

lr = 2e-5

In [None]:
# # Initializing a GPT2 configuration
# configuration = GPT2Config()

# # Initializing a model from the configuration
# model = GPT2ForSequenceClassification(configuration)

# # Accessing the model configuration
# configuration = model.config; configuration

In [9]:
#tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name, pad_token='<PAD>')
tokenizer.padding_side = "left"
tokenizer.model_max_len = 512
model = GPT2ForSequenceClassification.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of the model checkpoint at antoiloui/belgpt2 were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at antoiloui/belgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predict

In [46]:
model.config

GPT2Config {
  "_name_or_path": "antoiloui/belgpt2",
  "_num_labels": 2,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.6.1",
  "use_cache": true,
  "vocab_size": 50257
}

### Allocine Dataset

In [18]:
#creating a directory allocine
path = Path('/content/allocine/')
path.mkdir(parents=True, exist_ok=True)

In [19]:
#creating a directory allocine
path = Path('/content/allocine/models')
path.mkdir(parents=True, exist_ok=True)

In [20]:
path = Path('/content/allocine/'); path

Path('/content/allocine')

In [11]:
# downloading the AlloCine dataset
!wget -q https://github.com/TheophileBlard/french-sentiment-analysis-with-bert/raw/master/allocine_dataset/data.tar.bz2
!tar -xf /content/data.tar.bz2 -C '/content/allocine'

In [21]:
train_df = pd.read_json(path/'data/train.jsonl', lines=True, nrows=10000)
train_df.head(1)

Unnamed: 0,film-url,review,polarity
0,http://www.allocine.fr/film/fichefilm-135259/critiques/spectateurs,"Si vous cherchez du cinéma abrutissant à tous les étages,n'ayant aucune peur du cliché en castagnettes et moralement douteux,""From Paris with love"" est fait pour vous.Toutes les productions Besson,via sa filière EuropaCorp ont de quoi faire naître la moquerie.Paris y est encore une fois montrée comme une capitale exotique,mais attention si l'on se dirige vers la banlieue,on y trouve tout plein d'intégristes musulmans prêts à faire sauter le caisson d'une ambassadrice américaine.Nauséeux.Alors on se dit qu'on va au moins pouvoir apprécier la déconnade d'un classique buddy-movie avec le jeun...",0


In [22]:
train_df.to_csv(path/'data/train.csv', encoding = 'utf-8', header = True, index = False)
train_df.head(2)

Unnamed: 0,film-url,review,polarity
0,http://www.allocine.fr/film/fichefilm-135259/critiques/spectateurs,"Si vous cherchez du cinéma abrutissant à tous les étages,n'ayant aucune peur du cliché en castagnettes et moralement douteux,""From Paris with love"" est fait pour vous.Toutes les productions Besson,via sa filière EuropaCorp ont de quoi faire naître la moquerie.Paris y est encore une fois montrée comme une capitale exotique,mais attention si l'on se dirige vers la banlieue,on y trouve tout plein d'intégristes musulmans prêts à faire sauter le caisson d'une ambassadrice américaine.Nauséeux.Alors on se dit qu'on va au moins pouvoir apprécier la déconnade d'un classique buddy-movie avec le jeun...",0
1,http://www.allocine.fr/film/fichefilm-172430/critiques/spectateurs,"Trash, re-trash et re-re-trash...! Une horreur sans nom. Imaginez-vous les 20 premières minutes de Orange Mécanique dilatées sur plus de 70 minutes de bande VHS pourrave et revisitées par Korine à la sauce années 2000 : les dandys-punk de Kubrick ont laissé place à des papys lubriques déguisés en sacs-poubelles forniquant les troncs d'arbres, le dispositif esthétique se résume à du filmage-réalité enfilant des scènes de destruction, de soumission, de pornographie ou encore de maltraitance ( youtube, youtube et re-youtube...) et la bande-son se limite à des ricanements malades, des rengaine...",0


In [14]:
# # splitting a df by rows
# df_train = df.iloc[10001:-1]; len(df_train)

## Text Tokenization

In [23]:
tokenizer.vocab_size

50257

In [24]:
tokenizer.pad

<bound method PreTrainedTokenizerBase.pad of PreTrainedTokenizer(name_or_path='antoiloui/belgpt2', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='left', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<PAD>'})>

In [25]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [26]:
# testing the tokenizer 
tokenizer_fastai_fr = TransformersTokenizer(tokenizer)
text = "Peut-être que vous avez raison"
tokens_ids = tokenizer_fastai_fr.encodes(text)
tokens = tokenizer_fastai_fr.tokenizer.convert_ids_to_tokens(tokens_ids)

print('input text:',TitledStr(text))
print('text tokens:',TitledStr(tokens))
print('text tokens_ids:',TitledStr(tokens_ids))
print('output text:',TitledStr(tokenizer_fastai_fr.decodes(tokens_ids)))

input text: Peut-être que vous avez raison
text tokens: ['Peut', '-', 'Ãªtre', 'Ġque', 'Ġvous', 'Ġavez', 'Ġraison']
text tokens_ids: tensor([46906,    15,  1519,   354,   472,  1578,  1835])
output text: Peut-être que vous avez raison


## DataLoader

In [27]:
# from transformers import AutoModelForSequenceClassification

In [28]:
from transformers import GPT2ForSequenceClassification

In [29]:
class HFTextBlock(TransformBlock):
    "A `TransformBlock` for texts"
    def __init__(self, tokenizer):
        type_tfms = TransformersTokenizer(tokenizer)
        pad_first = tokenizer.padding_side == 'left'
        #pad_first = tokenizer.padding_side=='left')
        return super().__init__(type_tfms=type_tfms,
                                dl_type= SortedDL,
                                dls_kwargs={'before_batch':Pad_Chunk(pad_idx = tokenizer.pad_token_id, pad_first=(tokenizer.padding_side=='left'))})
        

In [22]:
# bs,sl = 8, 256

In [30]:
dls_clas = DataBlock(
        blocks=(HFTextBlock(tokenizer), CategoryBlock),
        get_y=ColReader('polarity'), 
        get_x=ColReader('review'), 
        splitter=RandomSplitter(),

).dataloaders(train_df, bs=16)

Could not do one pass in your dataloader, there is something wrong in it


In [31]:
xb,yb=dls_clas.one_batch()

ValueError: ignored

NOTE:  Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Using pad_token, but it is not set yet.
Could not do one pass in your dataloader, there is something wrong in it

CPU times: user 10.4 s, sys: 35.5 ms, total: 10.4 s
Wall time: 10.5 s

In [25]:
dls_clas.show_batch(max=3)

ValueError: ignored

In [48]:
print(len(dls_clas.train), len(dls_clas.valid))

500 125


In [49]:
dls_clas.c, dls_clas.vocab

(2, [0, 1])

## Text Learner

### CUDA out of memory.

In [None]:
# del learn
# del model_name
# torch.cuda.empty_cache()

In [52]:
def default_splitter(model):
    groups = L(model.base_model.children()) + L(m for m in list(model.children())[1:] if params(m))
    return groups.map(params)

In [59]:
class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

In [60]:
model = GPT2ForSequenceClassification.from_pretrained(model_name)
learn = Learner(dls_clas, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=accuracy).to_fp16() 

Some weights of the model checkpoint at antoiloui/belgpt2 were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at antoiloui/belgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
learn.model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=Fal

In [44]:
learn.lr_find()

NameError: ignored

In [None]:
learn.fit_one_cycle(1, 1e-5)