<a href="https://colab.research.google.com/github/tchotaneu/Transfert_learning/blob/main/Analyse_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# <!-- TITLE -->  Analyse de sentiments avec Transformers
<!-- DESC --> Utilisation d'un Tranformer pour effectuer une analyse de sentiment

## Objectifs :
 - Compléter l'apprentissage d'un transformateur pour effectuer une analyse des sentiments
 - Comprendre l'utilisation d'un transformateur pré-entraîné

Cette tâche est exactement la même que l'analyse de sentiment avec l'intégration de texte. Mais cette fois-ci,
nous allons exploiter la force des transformateurs. Compte tenu de la lourdeur de calcul que représente le pré-entraînement
nous allons utiliser un modèle BERT pré-entraîné de HuggingFace.


## ceq ue nous allons faire sur le notebook :
* Récupérer l'ensemble des données
* Préparer l'ensemble de données
* Récupérer un modèle BERT pré-entraîné de la plateforme HuggingFace (https://huggingface.co/models)
* Affiner le modèle sur une tâche de classification de séquences : l'analyse des sentiments de l'ensemble de données IMDB.
* Évaluer le résultat


## Installations

**IMPORTANT :** Nous aurons besoin d'utiliser la bibliothèque `transformers` créée par HuggingFace.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m123.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.4 MB/s[0m eta [36m0:00:

## Imports et  initialisation

In [2]:
import numpy as np

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.datasets.imdb as imdb
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

from transformers import (
    DistilBertTokenizer,
    TFDistilBertModel,
    DataCollatorWithPadding,
    BertTokenizer,
    TFBertModel
)

from tqdm.notebook import tqdm
import itertools
import multiprocessing
import os
import matplotlib.pyplot as plt
import seaborn as sns

print("Tensorflow ", tf.__version__)
n_gpus = len(tf.config.list_physical_devices('GPU'))
print("#GPUs: ", n_gpus)
if n_gpus > 0:
    !nvidia-smi -L
os.environ["TOKENIZERS_PARALLELISM"] = "true"

np.random.seed(987654321)
tf.random.set_seed(987654321)

Tensorflow  2.12.0
#GPUs:  1
GPU 0: Tesla T4 (UUID: GPU-ed62237c-c1cf-d11a-5702-c6c8c0239fe2)


## Paramètres

* `vocab_size` fait référence au nombre de mots qui seront mémorisés dans notre vocabulaire.
* `hide_most_frequently` est le nombre de mots ignorés, parmi les plus courants.
* `review_len` est la longueur de la revue.
* `n_cpus` est le nombre de CPU qui seront utilisés pour le prétraitement des données.
* `distil` indique si nous allons utiliser un modèle DistilBert ou un modèle Bert classique.

In [3]:
vocab_size = 30000
hide_most_frequently = 0

review_len = 512

epochs = 1
batch_size = 32

fit_verbosity = 1
scale = 1

n_cpus = 1
distil = True

##Récupérer l'ensemble de données

In [4]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words=vocab_size,
    skip_top=hide_most_frequently,
    seed=123456789,
)


y_train = np.asarray(y_train).astype('float32')
y_test  = np.asarray(y_test ).astype('float32')

n1 = int(scale * len(x_train))
n2 = int(scale * len(x_test))
x_train, y_train = x_train[:n1], y_train[:n1]
x_test,  y_test  = x_test[:n2],  y_test[:n2]

print("x_train : {}  y_train : {}".format(x_train.shape, y_train.shape))
print("x_test  : {}  y_test  : {}".format(x_test.shape,  y_test.shape))
print('\nReview sample (x_train[12]) :\n\n',x_train[12])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
x_train : (25000,)  y_train : (25000,)
x_test  : (25000,)  y_test  : (25000,)

Review sample (x_train[12]) :

 [1, 13, 124, 4, 84, 5, 13, 122, 49, 7, 4, 748, 5, 2177, 1592, 5, 4, 123, 9, 527, 36, 26, 1026, 117, 362, 37, 92, 28, 101, 676, 5, 242, 43, 11595, 1851, 8, 1779, 98, 2365, 47, 256, 4, 9397, 18, 31, 2, 207, 256, 18, 470, 300, 241, 4, 20, 9, 394, 5, 38, 9, 4, 123, 14, 9, 4, 24370, 91, 1849, 56, 212, 15, 60, 2, 163, 207, 126, 110, 12, 9, 38, 379, 12, 166, 72, 181, 8, 19361, 12, 9, 43, 38, 932, 15, 14002, 62, 126, 1779, 142, 40, 14, 5, 38, 995, 36, 26, 12373, 379, 5, 916, 13, 784, 98, 5, 68, 123, 5, 104, 280, 1851, 2503, 89, 379, 12, 9, 36, 80, 91, 2363, 193, 12, 125]


In [5]:
word_index = imdb.get_word_index()

word_index = {w:(i+3) for w,i in word_index.items()}
word_index.update({'[PAD]':0, '[CLS]':1, '[UNK]':2})
index_word = {index:word for word,index in word_index.items()}

# Add a nice function to transpose:
def dataset2text(review):
    return ' '.join([index_word.get(i, "?") for i in review[1:]])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [None]:
print(dataset2text(x_train[12]))

## Récupérer le modèle de HuggingFace

In [6]:
def load_model(distil):
    if distil:
        bert_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
        tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    else:
        bert_model = TFBertModel.from_pretrained("bert-base-uncased")
        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    return bert_model, tokenizer

bert_model, tokenizer = load_model(distil)
bert_model.summary()

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0
_________________________________________________________________


## Prepare le dataset

In [None]:
def tokenize_sample(sample):
    return tokenizer(dataset2text(sample), truncation=True, max_length=review_len)

def distributed_tokenize_dataset(dataset):
    ds = list(dataset)
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        tokenized_ds = list(tqdm(
            pool.imap(tokenize_sample, ds),
            total=len(ds)
        ))
    return tokenized_ds

tokenized_x_train = distributed_tokenize_dataset(x_train)
tokenized_x_test = distributed_tokenize_dataset(x_test)

  0%|          | 0/25000 [00:00<?, ?it/s]

  0%|          | 0/25000 [00:00<?, ?it/s]

In [None]:
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")

In [None]:
data_collator(tokenized_x_train)

In [None]:
def make_dataset(x, y):
    collated = data_collator(x)
    dataset = tf.data.Dataset.from_tensor_slices(
        (collated['input_ids'], collated['attention_mask'], y)
    )
    transformed_dataset = (
        dataset
        .map(
            lambda x, y, z: ((x, y), z)
        )
        .shuffle(25000)
        .batch(batch_size)
    )
    return transformed_dataset

train_ds = make_dataset(tokenized_x_train, y_train)
test_ds = make_dataset(tokenized_x_test, y_test)

for x, y in train_ds:
    print(x)
    break

## Ajouter une nouvelle tête au modèle

In [None]:
class ClassificationModel(keras.Model):

    def __init__(self, bert_model):
        super(ClassificationModel, self).__init__()
        self.bert_model = bert_model
        self.pre_classifier = Dense(768, activation='relu')
        self.dropout = Dropout(0.1)
        self.classifier = Dense(2)

    def call(self, x):
        x = self.bert_model(x)
        x = x.last_hidden_state
        x = x[:, 0] # get the output of the classification token
        x = self.pre_classifier(x)
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [None]:
model = ClassificationModel(bert_model)
x = next(iter(train_ds))[0]
model(x)
model.summary()

## Train!

In [None]:
model.compile(
    optimizer=Adam(1e-05),
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=[SparseCategoricalAccuracy('accuracy')]
)

In [None]:
history = model.fit(
    train_ds,
    epochs=epochs,
    verbose=fit_verbosity
)

## Evaluation

In [None]:
_, score = model.evaluate(test_ds)
colors = sns.color_palette('pastel')[2:]
accuracy_score = [score, 1 - score]
plt.pie(
    accuracy_score,
    labels=["Accurate", "Mistaken"],
    colors=colors,
    autopct=lambda val: f"{val:.2f}%",
    explode=(0.0, 0.1)
)
plt.show()