<a href="https://colab.research.google.com/github/singularity014/BERT_sentiment_analysis_quick/blob/master/BERT_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I - Import Dependencies

In [5]:
import numpy as np
import math
import re
import pandas as pd
from bs4 import BeautifulSoup
import random

from google.colab import drive

In [6]:
!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/35/5c/6439134ecd17b33fe0396fb0b7d6ce3c5a120c42a4516ba0e9a2d6e43b25/bert-for-tf2-0.14.4.tar.gz (40kB)
[K     |████████                        | 10kB 17.3MB/s eta 0:00:01[K     |████████████████▏               | 20kB 3.2MB/s eta 0:00:01[K     |████████████████████████▎       | 30kB 4.3MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 2.8MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2

In [8]:
# setting tf version to 2.x
try:
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert
tf.__version__

'2.2.0'

# II - Data Preprocessing

### Loading files


---
The link to the data is provided in README file of github



In [10]:
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [11]:
# data cols
cols = ["sentiment", "id", "date", "query", "user", "text"]
# loading via google drive path
data = pd.read_csv(
    "/content/drive/My Drive/sentiment_data/train_data.csv",
    header=None,
    names=cols,
    engine="python",
    encoding="latin1"
)

data.drop(["id", "date", "query", "user"],
          axis=1,
          inplace=True)

In [8]:
data.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


### Data Cleaning

In [13]:
def clean_tweet(tweet):
    tweet = BeautifulSoup(tweet, "lxml").get_text()
    # Removing the @
    tweet = re.sub(r"@[A-Za-z0-9]+", ' ', tweet)
    # Removing the URL links
    tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
    # Keeping only letters
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)
    # Removing additional whitespaces
    tweet = re.sub(r" +", ' ', tweet)
    return tweet


#

In [14]:
# clean the data
data_clean = [clean_tweet(tweet) for tweet in data.text]
# checking first few cleaned data
data_clean[:5]

[" Awww that's a bummer. You shoulda got David Carr of Third Day to do it. D",
 "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!",
 ' I dived many times for the ball. Managed to save The rest go out of bounds',
 'my whole body feels itchy and like its on fire ',
 " no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. "]

In [15]:
# changing label 4 to 1, because negative lables 
# are indicated as 4 in the dataset, hence converting them to 1
# for 0/1 conventional label
data_labels = data.sentiment.values
data_labels[data_labels == 4] = 1

### BERT Tokenization

We will use BERT style Tokenization.
Create a BERT layer to achieve it.

In [16]:
FullTokenizer = bert.bert_tokenization.FullTokenizer

bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

tokenizer = FullTokenizer(vocab_file, do_lower_case)

In [17]:
# BERT Style tokenization
# Note that convert_tokens_to_ids is similar 
# to texts_to_sequences in tf.keras utility
def encode_sentence(sent):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent))

In [18]:
data_inputs = [encode_sentence(sentence) for sentence in data_clean]
data[0:2]

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...


### Data Creation

Minimized padding
*   We will first pad the data, but we will do it via, padding according to length of sentences present in each batch
*   Create batch in such a way that each batch will have sentences of similar length.

*   With this strategy we will lose less data because of padding.
*   We will shuffle them so that same labeled data dont often appear in same batch.

In [20]:
# [ 
    # [<'sent'> , <'label'>, <'sent len'>]...
# ]

data_with_len = [[sent, data_labels[i], len(sent)]
                 for i, sent in enumerate(data_inputs)]

# shuffling data
random.shuffle(data_with_len)
# # sort the data by sentences lengths 
data_with_len.sort(key=lambda x: x[2])
# keep the data and label pair if the sentence length is greater than seven.
# to get rid of smaller sentences.
sorted_all = [(sent_lab[0], sent_lab[1])
              for sent_lab in data_with_len if sent_lab[2] > 7]

In [24]:
sorted_all[0:5] 
# basically we sorted the list of data


[([2003, 4394, 5003, 3336, 999, 6289, 23644, 999], 0),
 ([2009, 2052, 2022, 1039, 2074, 2005, 4569, 1012], 1),
 ([2183, 2188, 3666, 9340, 26668, 2006, 1996, 3902], 1),
 ([13763, 7034, 7034, 999, 2175, 2156, 999, 999], 1),
 ([2173, 1045, 1005, 1049, 3374, 2115, 9850, 3062], 0)]