## Product Sentiment Data - Imbalance

Data (public domain): https://data.world/crowdflower/brands-and-product-emotions

Notebook code based on IMDB notebook from bert-sklearn/other_examples

In [11]:
import numpy as np
import pandas as pd
import os
import sys
import csv
import re
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.utils import shuffle
from ftfy import fix_text
 
from bert_sklearn import BertClassifier
from bert_sklearn import load_model

print(os.getcwd())

DATAFILE = "./data/judge-expanded.csv"

/Users/joseph.porter/Data/nas2019/NAS2019


In [13]:
# Load Data

    
data = pd.read_csv(DATAFILE)
print(len(data))
data = data[data['text'].notnull()]
print(len(data))
data.head(10)

11493
11492


Unnamed: 0,text,company,label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Apple,-1
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Apple,1
2,@swonderlin Can not wait for #iPad 2 also. The...,Apple,1
3,@sxsw I hope this year's festival isn't as cra...,Apple,-1
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,1
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,0
7,"#SXSW is just starting, #CTIA is around the co...",Google,1
8,Beautifully smart and simple idea RT @madebyma...,Apple,1
9,Counting down the days to #sxsw plus strong Ca...,Apple,1
10,Excited to meet the @samsungmobileus at #sxsw ...,Google,1


In [14]:
# Split into training and test data

msk = np.random.rand(len(data)) < 0.8
train = data[msk]
test = data[~msk]
print('Training data size: ' + str(train.shape))
print('Test data size: ' + str(test.shape))

Training data size: (9281, 3)
Test data size: (2211, 3)


In [15]:
from collections import Counter

def print_dist(dataset, label='label'):
    
    dist = Counter(dataset[label])
    total = len(dataset)
    for k,v in sorted(dist.items(), key=lambda x: x[0]):
        pct = 100.0 * (float(v)/float(total))
        print(f'{k}: {v} ({pct:5.2f}%)')
    

In [16]:
print('Train dist:')
print(print_dist(train))
print('Test dist:')
print(print_dist(test))

Train dist:
-1: 2386 (25.71%)
0: 4506 (48.55%)
1: 2389 (25.74%)
None
Test dist:
-1: 584 (26.41%)
0: 1038 (46.95%)
1: 589 (26.64%)
None


In [17]:
train[:1].values

array([["@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",
        'Apple', 1]], dtype=object)

As you can see, each review is much longer than a sentence or two. The Google AI BERT models were trained on sequences of max length 512. Lets look at the performance for max_seq_length equal to  128, 256, and 512.

### max_seq_length = 128

In [18]:
## Set up data for the classifier

train = train.sample(800)
test = test.sample(500)

print("Train data size: %d "%(len(train)))
print("Test data size: %d "%(len(test)))

X_train = train['text']
y_train = train['label']

X_test = test['text']
y_test = test['label']

Train data size: 800 
Test data size: 500 


In [19]:
print('Train dist:')
print(print_dist(train))
print('Test dist:')
print(print_dist(test))

Train dist:
-1: 211 (26.38%)
0: 400 (50.00%)
1: 189 (23.62%)
None
Test dist:
-1: 141 (28.20%)
0: 227 (45.40%)
1: 132 (26.40%)
None


In [20]:
## Create the model

model = BertClassifier(bert_model='bert-base-uncased', label_list=[-1,0,1])
model.max_seq_length = 128
model.learning_rate = 2e-05
model.epochs = 4

print(model)


Building sklearn text classifier...
BertClassifier(bert_config_json=None, bert_model='bert-base-uncased',
               bert_vocab=None, do_lower_case=None, epochs=4, eval_batch_size=8,
               fp16=False, from_tf=False, gradient_accumulation_steps=1,
               ignore_label=None, label_list=[-1, 0, 1], learning_rate=2e-05,
               local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
               max_seq_length=128, num_mlp_hiddens=500, num_mlp_layers=0,
               random_state=42, restore_file=None, train_batch_size=32,
               use_cuda=True, validation_fraction=0.1, warmup_proportion=0.1)


In [21]:
%%time
## Train the model using our data (this could take a while)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 720, validation data size: 80


Training  : 100%|██████████| 23/23 [09:21<00:00, 24.39s/it, loss=0.964]
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.58s/it]

Epoch 1, Train loss: 0.9637, Val loss: 1.2498, Val accy: 31.25%



Training  : 100%|██████████| 23/23 [08:53<00:00, 23.18s/it, loss=0.77] 
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.56s/it]

Epoch 2, Train loss: 0.7700, Val loss: 0.6694, Val accy: 70.00%



Training  : 100%|██████████| 23/23 [08:45<00:00, 22.86s/it, loss=0.674]
Validating: 100%|██████████| 10/10 [00:16<00:00,  1.61s/it]

Epoch 3, Train loss: 0.6739, Val loss: 0.6430, Val accy: 70.00%



Training  : 100%|██████████| 23/23 [08:52<00:00, 23.14s/it, loss=0.651]
Validating: 100%|██████████| 10/10 [00:16<00:00,  1.66s/it]

Epoch 4, Train loss: 0.6507, Val loss: 0.6354, Val accy: 70.00%



Testing: 100%|██████████| 63/63 [01:48<00:00,  1.73s/it]


Loss: 0.6386, Accuracy: 69.20%
CPU times: user 1h 39min 57s, sys: 8min 35s, total: 1h 48min 33s
Wall time: 38min 47s





In [32]:
model.fit(X_train, y_train, load_at_start=False)
accy2 = model.score(X_test, y_test)

train data size: 720, validation data size: 80


Training  : 100%|██████████| 23/23 [08:46<00:00, 22.90s/it, loss=0.684]
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.55s/it]

Epoch 1, Train loss: 0.6838, Val loss: 0.5731, Val accy: 78.75%



Training  : 100%|██████████| 23/23 [08:51<00:00, 23.12s/it, loss=0.654]
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.57s/it]

Epoch 2, Train loss: 0.6539, Val loss: 0.4888, Val accy: 82.50%



Training  : 100%|██████████| 23/23 [08:44<00:00, 22.80s/it, loss=0.618]
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.54s/it]

Epoch 3, Train loss: 0.6181, Val loss: 0.5399, Val accy: 83.75%



Training  : 100%|██████████| 23/23 [08:43<00:00, 22.76s/it, loss=0.573]
Validating: 100%|██████████| 10/10 [00:15<00:00,  1.54s/it]

Epoch 4, Train loss: 0.5730, Val loss: 0.4612, Val accy: 78.75%



Testing: 100%|██████████| 63/63 [01:35<00:00,  1.52s/it]


Loss: 0.6648, Accuracy: 68.60%





In [33]:
y_pred = model.predict(X_test)

Predicting: 100%|██████████| 63/63 [01:48<00:00,  1.73s/it]


In [38]:
report = classification_report(y_test, y_pred, labels=[-1,0,1])
print(report)

              precision    recall  f1-score   support

          -1       0.95      0.85      0.90       141
           0       0.60      0.97      0.74       227
           1       0.33      0.02      0.04       132

    accuracy                           0.69       500
   macro avg       0.63      0.61      0.56       500
weighted avg       0.63      0.69      0.60       500



In [39]:
model.epochs = 2
model.fit(X_train, y_train, load_at_start=False)
accy3 = model.score(X_test, y_test)

train data size: 720, validation data size: 80


Training  : 100%|██████████| 23/23 [09:24<00:00, 24.55s/it, loss=0.672]
Validating: 100%|██████████| 10/10 [00:17<00:00,  1.78s/it]

Epoch 1, Train loss: 0.6720, Val loss: 0.5317, Val accy: 80.00%



Training  : 100%|██████████| 23/23 [09:23<00:00, 24.51s/it, loss=0.619]
Validating: 100%|██████████| 10/10 [00:16<00:00,  1.70s/it]

Epoch 2, Train loss: 0.6194, Val loss: 0.5671, Val accy: 80.00%



Testing: 100%|██████████| 63/63 [01:40<00:00,  1.60s/it]


Loss: 0.6646, Accuracy: 67.60%





In [40]:
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, labels=[-1,0,1])
print(report)

Predicting: 100%|██████████| 63/63 [01:47<00:00,  1.70s/it]

              precision    recall  f1-score   support

          -1       0.89      0.84      0.86       141
           0       0.60      0.97      0.74       227
           1       0.00      0.00      0.00       132

    accuracy                           0.68       500
   macro avg       0.50      0.60      0.53       500
weighted avg       0.52      0.68      0.58       500




  'precision', 'predicted', average, warn_for)


In [None]:
model.epochs = 2
model.fit(X_train, y_train, load_at_start=False)
accy3 = model.score(X_test, y_test)

In [34]:
print(y_pred[100:200])
print(y_test[100:200])

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1 -1  0  0  0  1  0
 -1  0  0  0  0 -1  0  0  0  0  0  0  0  0  0 -1  0  0  0  0 -1  0  0 -1
  0 -1 -1  0 -1  0  0  0 -1  1 -1  0  0 -1 -1  0  0  0  0  0 -1  0  0  0
 -1  0 -1  0  0  0  1  0  0  0  0 -1  0  0  0 -1  0  0  0  0  0 -1  0  0
  0 -1  0  0]
7672     0
1885     1
3589     1
7844     1
5737    -1
        ..
5248     0
6283     0
10614   -1
8187     1
3719     0
Name: label, Length: 100, dtype: int64


In [35]:
print(np.min(y_pred), np.max(y_pred))

-1 1


In [36]:
print(type(y_test))
y_test.describe()

<class 'pandas.core.series.Series'>


count    500.000000
mean      -0.018000
std        0.739439
min       -1.000000
25%       -1.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

In [31]:
%%time
## Test out the model with our own invented examples!

examples = [
    'This Android product is not very good',
    'I could not get that iPhone to work, so I sent it back. I''m really upset!',
    'Another great product from the folks at Google!  We really liked it a lot',
    'My iPad is essential - of course I would buy another one!',
    'When in the course of human events it becomes necessary to dissolve those ties...',
    'We the people, in order to form a more perfect union, establish justice, insure domestic tranquility, ...'
]

print(model.predict_proba(examples))
    

Predicting: 100%|██████████| 1/1 [00:01<00:00,  1.32s/it]

[[0.9803688  0.01074482 0.00888633]
 [0.9852483  0.00825785 0.00649386]
 [0.9806912  0.0106124  0.00869637]
 [0.9810533  0.01045346 0.00849328]
 [0.9809466  0.01066898 0.0083844 ]
 [0.9802644  0.0111946  0.00854098]]
CPU times: user 3.34 s, sys: 284 ms, total: 3.63 s
Wall time: 1.33 s





In [52]:
model.save('models/model1_128_bb_uncased.mdl')

### max_seq_length = 256

In [None]:
%%time
## Don't use this one - it will take a very long time!

model = BertClassifier(bert_model='bert-base-uncased', label_list=[-1,0,1])
model.max_seq_length = 256
model.train_batch_size = 32
model.learning_rate = 2e-05
model.epochs = 4

print(model)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)

### max_seq_length = 512

In [None]:
%%time
## Don't use this one - it will take the longest of all!

model = BertClassifier(bert_model='bert-base-uncased', label_list=[-1,0,1])
model.max_seq_length = 512

# max_seq_length=512 will use a lot more GPU mem, so I am turning down batch size 
# and adding gradient accumulation steps
model.train_batch_size = 16
model_gradient_accumulation_steps = 4

model.learning_rate = 2e-05
model.epochs = 4

print(model)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)