<a href="https://colab.research.google.com/github/vaibhavyesalwad/Sentiment-Analysis/blob/master/BERT_Text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installing transformers library (for using BERT)**

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 6.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 16.7MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 21.5MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |██

**Importing necessary libraries**

In [2]:
import pandas as pd
import numpy as np
import spacy
import pickle
import tensorflow as tf
import keras
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification


**Loading dataset from google drive**

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
!ln -s /content/gdrive/My\ Drive/ /mydrive
!ls /mydrive/Airline-Sentiment-Analysis

airline_sentiment_analysis.csv	bert_label.pkl	bert_model.h5  without-pronoun
bert_inp.pkl			bert_mask.pkl	tb_bert


In [5]:
!cp /mydrive/Airline-Sentiment-Analysis/airline_sentiment_analysis.csv ./

In [6]:
data = pd.read_csv('airline_sentiment_analysis.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,airline_sentiment,text
0,1,positive,@VirginAmerica plus you've added commercials t...
1,3,negative,@VirginAmerica it's really aggressive to blast...
2,4,negative,@VirginAmerica and it's a really big bad thing...
3,5,negative,@VirginAmerica seriously would pay $30 a fligh...
4,6,positive,"@VirginAmerica yes, nearly every time I fly VX..."


**Cleaning texts to create corpus**

In [7]:
nlp = spacy.load('en', disable=['parser', 'ner'])
import re

def clean_text(text, use_pronoun_token=False, lower=True):

  if lower:
    tokens = nlp(text.lower())
  else:
    tokens = nlp(text)

  words = []
  for token in tokens:
    lemma = token.lemma_
                
    # in spacy pronouns(you/me/he/she/his/him/they/them...etc) are lemmatised as '-PRON-'
    if lemma=='-PRON-':

      # if we want to use token PRONOUN in place of pronouns 
      if use_pronoun_token:
        words.append("PRONOUN")
      
      # using original pronouns as it is
      else:
        words.append(str(token))         
    
    # ignoring numbers & lemmas having presence of any other than alphanumeric character
    elif not (re.search("[^a-z0-9]", lemma) or lemma.isnumeric()):
      words.append(lemma)

  corpus = " ".join(words)
  return corpus         

In [8]:
data['corpus']= data['text'].apply(clean_text)

**Let's see cleaned text (corpus)**

In [9]:
for i in range(10):
  print("Original text:", data['text'][i])
  print("Cleaned text:", data['corpus'][i])
  print()

Original text: @VirginAmerica plus you've added commercials to the experience... tacky.
Cleaned text: plus you have add commercial to the experience tacky

Original text: @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
Cleaned text: it be really aggressive to blast obnoxious entertainment in your guest face amp they have little recourse

Original text: @VirginAmerica and it's a really big bad thing about it
Cleaned text: and it be a really big bad thing about it

Original text: @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.
it's really the only bad thing about flying VA
Cleaned text: seriously would pay a flight for seat that do not have this playing it be really the only bad thing about fly va

Original text: @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)
Cleaned text: yes nearly every time i fly vx this ear worm will not go away

Or

In [10]:
data['label'] = data['airline_sentiment'].apply(lambda x:int(x=='positive'))

In [11]:
data.head()

Unnamed: 0.1,Unnamed: 0,airline_sentiment,text,corpus,label
0,1,positive,@VirginAmerica plus you've added commercials t...,plus you have add commercial to the experience...,1
1,3,negative,@VirginAmerica it's really aggressive to blast...,it be really aggressive to blast obnoxious ent...,0
2,4,negative,@VirginAmerica and it's a really big bad thing...,and it be a really big bad thing about it,0
3,5,negative,@VirginAmerica seriously would pay $30 a fligh...,seriously would pay a flight for seat that do ...,0
4,6,positive,"@VirginAmerica yes, nearly every time I fly VX...",yes nearly every time i fly vx this ear worm w...,1


In [12]:
sentences=data['corpus']
labels=data['label']
classes = np.unique(data['airline_sentiment'].values)
num_classes = len(classes)
len(sentences),len(labels), num_classes


(11541, 11541, 2)

**Loading pretrained BERT tokenizer for tokenizing texts & BERT sequence classification model for fine tuning**

In [13]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=num_classes)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Observing cleaned text (corpus) lengths and respective counts**

In [14]:
from collections import Counter
Counter(len(sent.split()) for sent in sentences)

Counter({1: 85,
         2: 154,
         3: 156,
         4: 172,
         5: 220,
         6: 229,
         7: 241,
         8: 258,
         9: 287,
         10: 321,
         11: 333,
         12: 366,
         13: 364,
         14: 433,
         15: 491,
         16: 495,
         17: 532,
         18: 592,
         19: 675,
         20: 749,
         21: 825,
         22: 761,
         23: 752,
         24: 681,
         25: 495,
         26: 390,
         27: 265,
         28: 121,
         29: 66,
         30: 19,
         31: 11,
         32: 2})

**Data preprocessing using BERT tokenizer as needed for BERT classification model**

In [15]:
def data_preprocessing(data, max_length=64):
  """Function transforms cleaned text into input-ids and attention-masks"""
  
  sentences = data['corpus']
  input_ids=[]
  attention_masks=[]


  for sent in sentences:

    bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =max_length,pad_to_max_length = True,return_attention_mask = True)
    input_ids.append(bert_inp['input_ids'])
    attention_masks.append(bert_inp['attention_mask'])

  input_ids=np.asarray(input_ids)
  attention_masks=np.array(attention_masks)

  return input_ids, attention_masks

In [16]:
# max sequence length choosing 32 as it is good fit for our cleaned texts (corpus)
input_ids, attention_masks = data_preprocessing(data, max_length=32)
labels=np.array(labels)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


**Pickling input-ids and attention-masks so can be used at any run time**

In [None]:
print('Preparing the pickle file.....')

pickle_inp_path='/mydrive/Airline-Sentiment-Analysis/bert_inp.pkl'
pickle_mask_path='/mydrive/Airline-Sentiment-Analysis/bert_mask.pkl'
pickle_label_path='/mydrive/Airline-Sentiment-Analysis/bert_label.pkl'

pickle.dump((input_ids),open(pickle_inp_path,'wb'))
pickle.dump((attention_masks),open(pickle_mask_path,'wb'))
pickle.dump((labels),open(pickle_label_path,'wb'))


print('Pickle files saved as ',pickle_inp_path,pickle_mask_path,pickle_label_path)



Preparing the pickle file.....
Pickle files saved as  /mydrive/Airline-Sentiment-Analysis/bert_inp.pkl /mydrive/Airline-Sentiment-Analysis/bert_mask.pkl /mydrive/Airline-Sentiment-Analysis/bert_label.pkl


In [None]:
print('Loading the saved pickle files..')

input_ids=pickle.load(open(pickle_inp_path, 'rb'))
attention_masks=pickle.load(open(pickle_mask_path, 'rb'))
labels=pickle.load(open(pickle_label_path, 'rb'))

print('Input shape {} Attention mask shape {} Input label shape {}'.format(input_ids.shape,attention_masks.shape,labels.shape))


Loading the saved pickle files..
Input shape (11541, 32) Attention mask shape (11541, 32) Input label shape (11541,)


**Splitting features and labels in train and validation split**

In [None]:
train_inp,val_inp,train_label,val_label,train_mask,val_mask=train_test_split(input_ids,labels,attention_masks,test_size=0.2)

print('Train inp shape {} Val input shape {}\nTrain label shape {} Val label shape {}\nTrain attention mask shape {} Val attention mask shape {}'.format(train_inp.shape,val_inp.shape,train_label.shape,val_label.shape,train_mask.shape,val_mask.shape))

Train inp shape (9232, 32) Val input shape (2309, 32)
Train label shape (9232,) Val label shape (2309,)
Train attention mask shape (9232, 32) Val attention mask shape (2309, 32)


**Defining hyper-parameters for training of BERT model and saving model weights in google drive for future use**

In [17]:
log_dir='/mydrive/Airline-Sentiment-Analysis/tb_bert'
model_save_path='/mydrive/Airline-Sentiment-Analysis/bert_model.h5'

callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]

print('\nBert Model',bert_model.summary())

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])



Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________

Bert Model None


**Training/fine tuning BERT model**

In [None]:
history=bert_model.fit([train_inp,train_mask],train_label,batch_size=32,epochs=5,validation_data=([val_inp,val_mask],val_label),callbacks=callbacks)

Epoch 1/5
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**Loading weights of trained model**

In [18]:
bert_model.load_weights(model_save_path)

**Prediction on our custom input texts using fine-tuned model**

In [19]:
def predict_sentiment(texts):
  """Inferencing on any text dataset"""
  data = pd.DataFrame({'text':texts})
  data['corpus'] = data['text'].apply(clean_text)
  
  input_ids, attention_masks = data_preprocessing(data,max_length=32)
  results = bert_model.predict([input_ids, attention_masks])
  results = np.argmax(results[0], axis=1)
  results = [classes[label] for label in results]
  return results
  

In [20]:
sentences = ["shut up", "yeah fucking amazing", "not good", "I'm gonna sue you", "it was classy", "will travel again"]

In [21]:
predict_sentiment(texts=sentences)



['negative', 'positive', 'negative', 'negative', 'positive', 'negative']

**Brief evaluation using different metrics in classification report** 

In [None]:
from sklearn.metrics import classification_report

In [None]:
results = bert_model.predict([train_inp, train_mask])
train_inferences = np.argmax(results[0], axis=1)
print("Train set".center(50))
print(classification_report(train_label, train_inferences, target_names=classes))

                    Train set                     
              precision    recall  f1-score   support

    negative       0.97      0.98      0.98      7339
    positive       0.94      0.90      0.92      1893

    accuracy                           0.97      9232
   macro avg       0.95      0.94      0.95      9232
weighted avg       0.97      0.97      0.97      9232



In [None]:
results = bert_model.predict([val_inp, val_mask])
val_inferences = np.argmax(results[0], axis=1)
print("Validation set".center(50))
print(classification_report(val_label, val_inferences, target_names=classes))

                  Validation set                  
              precision    recall  f1-score   support

    negative       0.96      0.97      0.96      1839
    positive       0.88      0.82      0.85       470

    accuracy                           0.94      2309
   macro avg       0.92      0.90      0.91      2309
weighted avg       0.94      0.94      0.94      2309

