# Classification de sentiment avec (S)BERT 
Le but de ce notebook est d'implémenter BERT et SentenceBERT dans le cas d'usage de l'analyse de sentiment. Les versions de (S)BERT sont ici datées. Ce notebook vise à montrer comment on peut l'implémenter. 

## Import
Import des framework & data.

In [None]:
!pip install -U sentence-transformers
!pip install tensorflow_text

# Télécharger les data
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xvzf aclImdb_v1.tar.gz

In [None]:
# import
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import os

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sentence_transformers import SentenceTransformer

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

from transformers import BertTokenizer, TFBertModel

## Création du DataFrame

In [None]:
# Récupération de la data
def fetch_reviews(path):
  data = []
  #path = 'aclImdb/train/pos/'
  files = [f for f in os.listdir(path)]
  for file in files:
    with open(path+file, "r", encoding='utf8') as f:
      data.append(f.read())
  return data

In [None]:
# Création du DataFrame
df_train_pos = pd.DataFrame({'review': fetch_reviews('aclImdb/train/pos/'), 'label': 1})
df_train_neg = pd.DataFrame({'review': fetch_reviews('aclImdb/train/neg/'), 'label': 0})

df_test_pos = pd.DataFrame({'review': fetch_reviews('aclImdb/test/pos/'), 'label': 1})
df_test_neg = pd.DataFrame({'review': fetch_reviews('aclImdb/test/neg/'), 'label': 0})

# Merging all df's for data cleaning and preprocessing step.
df = pd.concat([df_train_pos, df_train_neg, df_test_pos, df_test_neg], ignore_index=True)
print("Total reviews in df: ", df.shape)
df.head()

Total reviews in df:  (50000, 2)


Unnamed: 0,review,label
0,What ever happened to Michael Keaton? What a g...,1
1,Although time has revealed how some of the eff...,1
2,"Ok, so it may not be the award-winning ""movie ...",1
3,As a former 2 time Okinawan Karate world champ...,1
4,The only time I have seen this movie was when ...,1


In [None]:
# Diminution du DataFrame original par un facteur
facteur = 100
df = pd.DataFrame(np.concatenate((df[:50000//2//facteur],df[-50000//2//facteur:]),axis=0),columns=('review', 'label'))
df

Unnamed: 0,review,label
0,What ever happened to Michael Keaton? What a g...,1
1,Although time has revealed how some of the eff...,1
2,"Ok, so it may not be the award-winning ""movie ...",1
3,As a former 2 time Okinawan Karate world champ...,1
4,The only time I have seen this movie was when ...,1
...,...,...
495,Rented this tonite from my local video store. ...,0
496,This film and the 1st AvP film both all over t...,0
497,"I am a fifth grade language arts teacher, and ...",0
498,"As a fan of C.J.'s earlier movie, Latter Days,...",0


In [None]:
# train_test_split
data = df.copy()
y = list(data['label'])
data.drop(['label'], axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, stratify=y)

print("Train data:",  X_train.shape, len(y_train))
print("Test data:",  X_test.shape, len(y_test))

Train data: (350, 1) 350
Test data: (150, 1) 150


## DataFrame

In [None]:
df

Unnamed: 0,review,label
0,What ever happened to Michael Keaton? What a g...,1
1,Although time has revealed how some of the eff...,1
2,"Ok, so it may not be the award-winning ""movie ...",1
3,As a former 2 time Okinawan Karate world champ...,1
4,The only time I have seen this movie was when ...,1
...,...,...
495,Rented this tonite from my local video store. ...,0
496,This film and the 1st AvP film both all over t...,0
497,"I am a fifth grade language arts teacher, and ...",0
498,"As a fan of C.J.'s earlier movie, Latter Days,...",0


## SentenceBERT
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Grace à  ``` SentenceTransformer ``` on peut télécharger un modèle d'Hugging Face facilement.



In [None]:
# Import du modèle sentence_transformers d'Hugging Face
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Chaque commentaire est encodé (embedding) avec model.encode()
X_train_sbert = model.encode(list(X_train.review))
X_test_sbert = model.encode(list(X_test.review))

In [None]:
# Notre DataFrame d'entrainement X contient désormais uniquement des floats
print("Train data:", X_train_sbert.shape, len(y_train))
print("Test data:",   X_test_sbert.shape, len(y_test))

Train data: (350, 384) 350
Test data: (150, 384) 150


In [None]:
# Utilisation d'un algorithme classique pour faire de la classification binaire
clf_sbert = LogisticRegression(penalty='l2', max_iter=1000)
clf_sbert.fit(X_train_sbert, list(y_train))

y_pred = clf_sbert.predict(X_test_sbert) #prediction from model

In [None]:
# Matrice des métriques de classification SBERT
print(classification_report(list(y_test), y_pred))

              precision    recall  f1-score   support

           0       0.80      0.76      0.78        75
           1       0.77      0.81      0.79        75

    accuracy                           0.79       150
   macro avg       0.79      0.79      0.79       150
weighted avg       0.79      0.79      0.79       150



## BERT
https://www.analyticsvidhya.com/blog/2021/12/text-classification-using-bert-and-tensorflow/

In [None]:
# Import modèle avec tensorflow_hub
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [None]:
# fonction qui renvoie l'embedding d'un paragraphe
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

In [None]:
# Chaque commentaire est encodé (embedding) avec get_sentence_embeding
X_train_b = []
for i in range(len(X_train)):
  text = list(X_train.review)[i]
  X_train_b.append(get_sentence_embeding([text])[0])

X_test_b = []
for i in range(len(X_test)):
  text = list(X_test.review)[i]
  X_test_b.append(get_sentence_embeding([text])[0])

In [None]:
# Regression logistique
clf_sbert = LogisticRegression(penalty='l2', max_iter=1000)
clf_sbert.fit(X_train_b, list(y_train))

y_pred = clf_sbert.predict(X_test_b)

In [None]:
# Matrice des métriques de classification BERT
print(classification_report(list(y_test), y_pred))

              precision    recall  f1-score   support

           0       0.67      0.65      0.66        75
           1       0.66      0.68      0.67        75

    accuracy                           0.67       150
   macro avg       0.67      0.67      0.67       150
weighted avg       0.67      0.67      0.67       150

