# Sentiment Analysis with Deep Learning using BERT

### Project Outline

**Task 1**: Introduction ((Expliquer la différence entre BERT/CamemBERT et Tfidf))

**Task 2**: Analyse exploratoire et prétraitement des données

**Task 3**: Training/Validation Split

**Task 4**: Chargement du Tokenizer et encodage de nos données

**Task 5**: Entrainer un modèle

**Task 6**: Classification des documents à l'aide de la régression logistique multinomiale

**Task 7**: Evaluation sur la base de validation

**Task 8**: Tester le Random Forest, SVM, Xgboost, Light GBM, Stacking

## Task 1: Introduction (Expliquer la différence entre BERT/CamemBERT et Tfidf)

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805).

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="BERT_diagrams.pdf" width="1000">

## Task 2: Analyse exploratoire et prétraitement des données

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [None]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np

In [None]:
df = pd.read_csv(
                '/content/drive/MyDrive/BERT_Sentiment_Analysis_CyTech/smile-annotations-final.csv',
                names = ['id', 'text', 'category']
                )
df.set_index('id', inplace = True)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [None]:
df['category'].value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [None]:
df['text'].iloc[0] # Regarder le premier commentaire

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [None]:
df = df[~df['category'].str.contains('\|')] # Enlever toutes les lignes contenant le caractere |
                                            # (synonyme de double sentiment exprime)

In [None]:
df = df[df['category'] != 'nocode'] # Enlever les lignes contenant la modalite nocode

In [None]:
df['category'].value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [None]:
df['category'].unique()

array(['happy', 'not-relevant', 'angry', 'disgust', 'sad', 'surprise'],
      dtype=object)

In [None]:
possible_labels = df['category'].unique()

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [None]:
df['label'] = df['category'].map(label_dict)
df.head(10)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0


## Task 3: Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, X_y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.15,
    random_state=17, # Pour la reproductibilite des analyses/resultats
    stratify = df['label']
)

In [None]:
df['data_type'] = ['not_set']*df.shape[0] # Creation de la base d'apprentissage et de test

In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Task 4: Chargement du Tokenizer et encodage de nos données

In [None]:

!apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libgoogle-perftools-dev is already the newest version (2.9.1-0ubuntu3).
pkg-config is already the newest version (0.29.2-1ubuntu3).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [None]:
import torch

In [None]:
import transformers as ppb

camembert, tokenizer, weights = (ppb.CamembertModel, ppb.CamembertTokenizer, 'camembert-base')

In [None]:
# Load pretrained model/tokenizer
tokenizer = tokenizer.from_pretrained(weights)
model = camembert.from_pretrained(weights)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

In [None]:
df_app = df[df['data_type'] == 'train']
df_test = df[df['data_type'] == 'val']

In [None]:
df_app.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,train
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,train


In [None]:
# Bert ne sait que tokéniser des phrases de longueur maximale de 512 tokens. Ici nous allons simplement enlever les commentaires trop longs.

# see if there are length > 512
max_len_app = 0
for i,sent in enumerate(df_app['text']):
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids_app = tokenizer.encode(sent, add_special_tokens=True)
    if len(input_ids_app) > 512:
        print("annoying review at", i,"with length",
              len(input_ids_app))
    # Update the maximum sentence length.
    max_len_app = max(max_len_app, len(input_ids_app))

print('Max sentence length: ', max_len_app)


# Bert ne sait que tokéniser des phrases de longueur maximale de 512 tokens. Ici nous allons simplement enlever les commentaires trop longs.

# see if there are length > 512
max_len_test = 0
for i,sent in enumerate(df_test['text']):
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids_test = tokenizer.encode(sent, add_special_tokens=True)
    if len(input_ids_test) > 512:
        print("annoying review at", i,"with length",
              len(input_ids_test))
    # Update the maximum sentence length.
    max_len_test = max(max_len_test, len(input_ids_test))

print('Max sentence length: ', max_len_test)

Max sentence length:  97
Max sentence length:  73


In [None]:
tokenized_app = df_app['text'].apply((lambda x: tokenizer.encode(str(x), add_special_tokens=True)))
max_len_app = 0
for i in tokenized_app.values:
    if len(i) > max_len_app:
        max_len_app = len(i)

padded_app = np.array([i + [0]*(max_len_app-len(i)) for i in tokenized_app.values])
np.array(padded_app).shape

(1258, 97)

In [None]:
tokenized_test = df_test['text'].apply((lambda x: tokenizer.encode(str(x), add_special_tokens=True)))
max_len_test = 0
for i in tokenized_test.values:
    if len(i) > max_len_test:
        max_len_test = len(i)

padded_test = np.array([i + [0]*(max_len_test-len(i)) for i in tokenized_test.values])
np.array(padded_test).shape

(223, 73)

In [None]:
attention_mask_app = np.where(padded_app != 0, 1, 0)
attention_mask_app.shape

(1258, 97)

In [None]:
attention_mask_test = np.where(padded_test != 0, 1, 0)
attention_mask_test.shape

(223, 73)

In [None]:
# Enfin nous transformer les tokens en tensor pour les passer dans le fameux transformer. Seule la dernière
# couche est conservée pour faire la classification.

input_ids_app = torch.tensor(padded_app)
attention_mask_app = torch.tensor(attention_mask_app)

In [None]:
len(attention_mask_app)

1258

In [None]:
# Enfin nous transformer les tokens en tensor pour les passer dans le fameux transformer. Seule la dernière
# couche est conservée pour faire la classification.

input_ids_test = torch.tensor(padded_test)
attention_mask_test = torch.tensor(attention_mask_test)

In [None]:
len(attention_mask_test)

223

In [None]:
with torch.no_grad():
     last_hidden_states_app = model(input_ids_app, attention_mask=attention_mask_app)

In [None]:
with torch.no_grad():
     last_hidden_states_test = model(input_ids_test, attention_mask=attention_mask_test)

## Task 5: Entrainer un modèle

In [None]:
features_valid = last_hidden_states_test[0][:,0,:].numpy()
labels_valid = df_test.label
labels_valid

id
613359710343929857    1
611947559444172801    0
612264160311803905    0
611844583224438784    0
615216447787270144    0
                     ..
614815258092421120    0
612216252686299136    0
611554358812090368    0
613813229735804928    0
610829951890120704    0
Name: label, Length: 223, dtype: int64

In [None]:
features = last_hidden_states_app[0][:,0,:].numpy()
labels = df_app.label
labels

id
614484565059596288    0
614746522043973632    0
614877582664835073    0
611932373039644672    0
611570404268883969    0
                     ..
611258135270060033    1
612214539468279808    0
613678555935973376    0
615246897670922240    0
613016084371914753    1
Name: label, Length: 1258, dtype: int64

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(
    features,
    labels,
    test_size = 0.2,
    # random_state=39444, # Pour la reproductibilite des analyses/resultats
    stratify = labels
)


## Task 6: Classification des documents à l'aide de la régression logistique multinomiale

In [None]:
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg').fit(train_features, train_labels)
preds = model1.predict(test_features)

#print the tunable parameters (They were not tuned in this example, everything kept as default)
params = model1._()
print(params)

{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'multinomial', 'n_jobs': None, 'penalty': 'none', 'random_state': 0, 'solver': 'newton-cg', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


In [None]:
# Model validation
from sklearn.metrics import accuracy_score
print('Accuracy: {:.2f}'.format(accuracy_score(test_labels, preds)))
print('Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, preds)))

Accuracy: 0.81
Error rate: 0.19


## Task 7: Evaluation sur la base de validation

In [None]:
preds_valid = model1.predict(features_valid)

In [None]:
# Prediction finale avec inverse Tag
final_preds = pd.DataFrame(preds_valid)
final_preds = final_preds.rename(columns={0: 'preds_Tag'})

label_dict_inverse = {}
for index, possible_label in _(possible_labels):
    label_dict_inverse[index] = possible_label

label_dict_inverse

{0: 'happy',
 1: 'not-relevant',
 2: 'angry',
 3: 'disgust',
 4: 'sad',
 5: 'surprise'}

In [None]:
final_preds['preds_Tag'] = final_preds['preds_Tag']._(label_dict_inverse)

final_preds

Unnamed: 0,preds_Tag
0,happy
1,happy
2,happy
3,happy
4,not-relevant
...,...
218,happy
219,happy
220,happy
221,happy


In [None]:
# Model validation
print('Accuracy: {:.2f}'.format(accuracy_score(labels_valid, preds_valid)))
print('Error rate: {:.2f}'.format(1 - accuracy_score(labels_valid, preds_valid)))

Accuracy: 0.79
Error rate: 0.21


In [None]:
#Create classification report
from sklearn.metrics import classification_report
class_report=_(labels_valid, preds_valid)
print(class_report)

              precision    recall  f1-score   support

           0       0.88      0.90      0.89       171
           1       0.54      0.44      0.48        32
           2       0.78      0.78      0.78         9
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.25      0.20      0.22         5

    accuracy                           0.79       223
   macro avg       0.41      0.39      0.40       223
weighted avg       0.79      0.79      0.79       223



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Calculated probabilities
df_results = pd.DataFrame(model1.predict_proba(features_valid), columns=model1.classes_)
valid_values = df_test[['text']]
valid_tags = df_test[['category']]
#valid_documents = df_test[['id']]
valid_values.index = pd.RangeIndex(len(valid_values.index))
valid_tags.index = pd.RangeIndex(len(valid_tags.index))
#valid_documents.index = pd.RangeIndex(len(valid_documents.index))
df_results.index = pd.RangeIndex(len(df_results.index))

In [None]:
frames = [valid_values, valid_tags, final_preds, df_results.round(decimals = 6)]
result = pd.concat(frames, axis=1)

In [None]:
class_report=_(valid_tags, final_preds)
print(class_report)

              precision    recall  f1-score   support

       angry       0.78      0.78      0.78         9
     disgust       0.00      0.00      0.00         1
       happy       0.88      0.90      0.89       171
not-relevant       0.54      0.44      0.48        32
         sad       0.00      0.00      0.00         5
    surprise       0.25      0.20      0.22         5

    accuracy                           0.79       223
   macro avg       0.41      0.39      0.40       223
weighted avg       0.79      0.79      0.79       223



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
result = result.rename(columns=label_dict_inverse)
result

## Task 8: Tester le Random Forest, SVM, Xgboost, Light GBM, Stacking

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


rf_model = RandomForestClassifier(random_state=39444)
rf_model.fit(train_features, train_labels)
rf_preds = rf_model.predict(test_features)

print('Random Forest Accuracy: {:.2f}'.format(accuracy_score(test_labels, rf_preds)))
print('Random Forest Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, rf_preds)))
print('\nClassification Report:\n', classification_report(test_labels, rf_preds))

## SVM

In [None]:
from sklearn.svm import SVC

svm_model = SVC(random_state=39444)
svm_model.fit(train_features, train_labels)
svm_preds = svm_model.predict(test_features)

print('SVM Accuracy: {:.2f}'.format(accuracy_score(test_labels, svm_preds)))
print('SVM Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, svm_preds)))
print('\nClassification Report:\n', classification_report(test_labels, svm_preds))

## Xgboost

In [None]:
!pip install xgboost

In [None]:
import xgboost as xgb

xgb_model = xgb.XGBClassifier(random_state=39444)
xgb_model.fit(train_features, train_labels)
xgb_preds = xgb_model.predict(test_features)

print('XGBoost Accuracy: {:.2f}'.format(accuracy_score(test_labels, xgb_preds)))
print('XGBoost Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, xgb_preds)))
print('\nClassification Report:\n', classification_report(test_labels, xgb_preds))

## LightGBM

In [None]:
!pip install lightgbm

In [None]:
import lightgbm as lgb


lgb_model = lgb.LGBMClassifier(random_state=39444)
lgb_model.fit(train_features, train_labels)
lgb_preds = lgb_model.predict(test_features)

print('LightGBM Accuracy: {:.2f}'.format(accuracy_score(test_labels, lgb_preds)))
print('LightGBM Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, lgb_preds)))
print('\nClassification Report:\n', classification_report(test_labels, lgb_preds))

## Stacking

In [None]:
from sklearn.ensemble import StackingClassifier


estimators = [
    ('rf', RandomForestClassifier(random_state=39444)),
    ('svm', SVC(random_state=39444)),
    ('xgb', xgb.XGBClassifier(random_state=39444)),
    ('lgb', lgb.LGBMClassifier(random_state=39444))
]

# Init avec une regression logistique
stacking_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=17)
)

stacking_model.fit(train_features, train_labels)
stacking_preds = stacking_model.predict(test_features)

# Evaluate the performance
print('Stacking Accuracy: {:.2f}'.format(accuracy_score(test_labels, stacking_preds)))
print('Stacking Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, stacking_preds)))
print('\nClassification Report:\n', classification_report(test_labels, stacking_preds))