# HW3 - Compare SVM and BERT for Causal Language Detection

In [1]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We will first build prediction model using SVM and then a second model using BERT. <br>

## SVM

### Step 1: Read in data

In [2]:
# reading the provided csv file

import pandas as p
import numpy as np
train = p.read_csv("/content/drive/MyDrive/Colab Notebooks/pubmed_causal_language_use.csv")
y = train['label'].values
X = train['sentence'].values

### Step 2: Split train/test data for hold-out test

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(2448,) (2448,) (613,) (613,)
The high rate of text message usage makes it feasible to recruit YAMs for a prospective study in which personalized text messages are used to promote healthy behaviours.
0
The lack of symptoms and the preoperative EGD findings were not suggestive of this diagnosis in any case.
0


### Step 2.1 Data Checking

In [4]:
# Check how many training examples in each category
# this is important to see whether the data set is balanced or skewed

unique, counts = np.unique(y_train, return_counts = True)
print(np.asarray((unique, counts)))

[[   0    1    2    3]
 [1055  409  169  815]]


In [5]:
# Check how many test examples in each category

uniqueTest, countsTest = np.unique(y_test, return_counts = True)
print(np.asarray((uniqueTest, countsTest)))

[[  0   1   2   3]
 [301  85  44 183]]


### Step 3: Vectorization

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding = 'latin-1', binary = True, min_df = 5, stop_words = 'english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding = 'latin-1', binary = False, min_df = 5, stop_words = 'english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding = 'latin-1', ngram_range = (1,2), min_df = 5, stop_words = 'english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding = 'latin-1', use_idf = True, min_df = 5, stop_words = 'english')


#### Step 3.1: Vectorize the training data

We will use tfidf Vectorizer because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. This should increase the accuracy of classification and should help in creating better hyperplanes between different labels.

References -
1. https://ieeexplore.ieee.org/document/9274650
2. https://www.linkedin.com/pulse/count-vectorizers-vs-tfidf-natural-language-processing-sheel-saket#:~:text=TF%2DIDF%20is%20better%20than,by%20reducing%20the%20input%20dimensions.

In [7]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_tfidf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_tfidf_vectorizer.vocabulary_.get('physical'))

(2448, 1195)
[[0. 0. 0. ... 0. 0. 0.]]
1195
[('high', 480), ('rate', 889), ('feasible', 402), ('prospective', 867), ('study', 1056), ('used', 1155), ('promote', 863), ('healthy', 472), ('reductions', 910), ('ldl', 591)]
786


#### Step 3.2: Vectorize the test data

In [8]:
X_test_vec = unigram_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(613, 1195)


### Step 4: Train a LinearSVC classifier

In [9]:
# import the LinearSVC module
from sklearn.svm import LinearSVC

# initialize the LinearSVC model
svm_clf = LinearSVC(C = 1)

# use the training data to train the model
svm_clf.fit(X_train_vec, y_train)

LinearSVC(C=1)

#### Step 4.1 Interpret a trained LinearSVC model

In [10]:
## LinearSVC will build four binary classifier, "0 vs. others", "1 vs. others", "2 vs. others", "3 vs. others"
## and then pick the most confident prediction as the final prediction.

## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "No relationship", get all features and their weights and sort them in increasing order
feature_ranks_0 = sorted(zip(svm_clf.coef_[0], unigram_tfidf_vectorizer.get_feature_names_out()))

## get the 10 features that are best indicators of 0 label (they are at the bottom of the ranked list)
label_0_10 = feature_ranks_0[-10:]
print("No relationship words")
for i in range(0, len(label_0_10)):
    print(label_0_10[i])
print()

## For category "Direct causal", get all features and their weights and sort them in increasing order
feature_ranks_1 = sorted(zip(svm_clf.coef_[1], unigram_tfidf_vectorizer.get_feature_names_out()))

## get the 10 features that are best indicators of 1 label (they are at the bottom of the ranked list)
label_1_10 = feature_ranks_1[-10:]
print("Direct causal words")
for i in range(0, len(label_1_10)):
    print(label_1_10[i])
print()

## For category "Conditional causal", get all features and their weights and sort them in increasing order
feature_ranks_2 = sorted(zip(svm_clf.coef_[2], unigram_tfidf_vectorizer.get_feature_names_out()))

## get the 10 features that are best indicators of 2 label (they are at the bottom of the ranked list)
label_2_10 = feature_ranks_2[-10:]
print("Conditional causal words")
for i in range(0, len(label_2_10)):
    print(label_2_10[i])
print()

## For category "Correlational", get all features and their weights and sort them in increasing order
feature_ranks_3 = sorted(zip(svm_clf.coef_[3], unigram_tfidf_vectorizer.get_feature_names_out()))

## get the 10 features that are best indicators of 3 label (they are at the bottom of the ranked list)
label_3_10 = feature_ranks_3[-10:]
print("Correlational words")
for i in range(0, len(label_3_10)):
    print(label_3_10[i])
print()


No relationship words
(1.4084141619006691, 'focus')
(1.4285348262007627, 'assessment')
(1.437899423026315, 'performed')
(1.4500503830646205, 'trials')
(1.578807001412499, 'assess')
(1.6878044924579994, 'required')
(1.6881981040140208, 'implications')
(1.7434850186481985, 'research')
(2.139177837391045, 'studies')
(2.9535661114934744, 'needed')

Direct causal words
(1.5989563340448967, 'reduced')
(1.6949145452723313, 'benefits')
(1.835859503643258, 'oral')
(1.8431860264058906, 'reduces')
(2.079409562609014, 'beneficial')
(2.1452524847693106, 'improves')
(2.296892749074168, 'did')
(2.351600428452984, 'improved')
(2.4372985451536002, 'effective')
(3.1057975144919747, 'resulted')

Conditional causal words
(1.3903558534684948, 'increase')
(1.4051896362780498, 'role')
(1.4258534581022673, 'decrease')
(1.4321473394467839, 'responsible')
(1.4345276902495223, 'appeared')
(1.4386656159026316, 'result')
(1.4461063609129279, 'appear')
(1.7400567109554084, 'play')
(1.779270577898176, 'reduce')
(1.9

### Step 5: Test the LinearSVC classifier

In [11]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7357259380097879

In [12]:
# print confusion matrix and classification report

from sklearn.metrics import confusion_matrix
y_pred = svm_clf.predict(X_test_vec)
cm = confusion_matrix(y_test, y_pred, labels = [0,1,2,3])
print(cm)
print()

from sklearn.metrics import classification_report
target_names = ['0', '1', '2', '3']
print(classification_report(y_test, y_pred, target_names = target_names))

[[249  21   5  26]
 [ 21  47   4  13]
 [ 11  13  12   8]
 [ 26  11   3 143]]

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       301
           1       0.51      0.55      0.53        85
           2       0.50      0.27      0.35        44
           3       0.75      0.78      0.77       183

    accuracy                           0.74       613
   macro avg       0.64      0.61      0.62       613
weighted avg       0.73      0.74      0.73       613



#### Step 5.1 Interpret the prediction result

In [13]:
## get the confidence scores for all test examples from each of the four binary classifiers
svm_confidence_scores = svm_clf.decision_function(X_test_vec)
## get the confidence score for the first test example
print(svm_confidence_scores[0])

## Confirm by printing out the actual prediction
print(y_test[0])
print(X[0])

[ 0.70898614 -1.24137025 -1.49552454 -0.50387148]
0
Levels of cholesterol fractions in patients with JIA were found within the normal range.


because the confidence score is the highest for category 0, the prediction should be (and is as seen above) 0.

In [14]:
# output prediction probs

from sklearn.calibration import CalibratedClassifierCV
svm_calibrated = CalibratedClassifierCV(svm_clf) 
svm_calibrated.fit(X_train_vec, y_train)
y_test_proba = svm_calibrated.predict_proba(X_test_vec)
y_test_proba[0]

array([0.69978076, 0.07508116, 0.02046983, 0.20466825])

#### Step 5.2 Error Analysis

In [15]:
# print out specific type of error for further analysis

# print out the correlational examples that are mistakenly predicted as no relationship
# according to the confusion matrix, there should be 26 such examples

print("correlational examples that are mistakenly predicted as no relationship -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 3 and y_pred[i] == 0):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()

# print out the no relationship examples that are mistakenly predicted as correlational
# according to the confusion matrix, there should be 26 such examples

print("no relationship examples that are mistakenly predicted as correlational -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 0 and y_pred[i] == 3):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

correlational examples that are mistakenly predicted as no relationship -

Adopting a home-like model of residential care may be associated with reduced prevalence and costs of PIMs.
In ACS patients, without previous history of DM, MS is highly prevalent.
Reproductive factors were to some extent associated with all subtypes; the strongest trends were with luminal-like subtypes.
Distractions are prevalent in ORs and in this study were linked to deterioration in intraoperative patient safety checks.
Extrapulmonary manifestations may be useful clues for diagnosis.
Oral impacts were more frequently reported in T2D cases than controls.
More than three-fourths of the patients in this study had a history of smoking, which is considered a triggering or aggravating factor for PPP.
Chemerin showed positive correlations with potent health threatening components of lipid profile including triglyceride and cholesterol levels in adolescents.
Data from NIV can identify a change in breathing patterns 

In [16]:
# print out specific type of error for further analysis

# print out the conditional causal examples that are mistakenly predicted as direct causal
# according to the confusion matrix, there should be 13 such examples

print("conditional causal examples that are mistakenly predicted as direct causal -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 2 and y_pred[i] == 1):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()

# print out the direct causal examples that are mistakenly predicted as conditional causal
# according to the confusion matrix, there should be 4 such examples

print("direct causal examples that are mistakenly predicted as conditional causal -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 1 and y_pred[i] == 2):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

conditional causal examples that are mistakenly predicted as direct causal -

Multiple and immediate access to the web-based education program at home may prove useful as a source of reference for women with GDM.
We conclude that PTG tends to lead to less psychological distress overall but particularly so in a high impact context.(PsycINFO Database Record
Hence, this community-based overweight and obesity treatment program may help accommodate the need for improvements in treatment availability and accessibility.
Thus, the gut may mediate vitamin D effects.
Incorporating culturally centered individualized MT may be an effective intervention to reduce stressors.
"However, the results confirm that the vaccine is safe and suggest that vaccination may have clinical benefit in patients with low HER2-expressing tumors, specifically TNBC."
Although not significant, IR levels decreased over time, which could be indicative of a clinical improvement.
The survival benefit suggests that diabetes s

### Step 6. Cross Validation

In [17]:
# Cross validation
# for more details see https://scikit-learn.org/stable/modules/cross_validation.html
# evaluation metrics provided by sklearn - https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# if you need to output only one metric, such as accuracy or f1_macro, use the "cross_val_score" function

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

svm_clf_pipe = Pipeline([('vect', TfidfVectorizer(encoding = 'latin-1', use_idf = True)),('svm', LinearSVC(C = 1))])
scores = cross_val_score(svm_clf_pipe, X, y, cv = 5, scoring = 'f1_macro')

In [18]:
# if you need to output multiple metrics, such as accuracy and f1_macro, use the "cross_validate" function
from sklearn.model_selection import cross_validate

scoring = ['accuracy', 'f1_macro', 'f1_micro', 'precision_macro', 'recall_macro']
svm_clf_pipe = Pipeline([('vect', TfidfVectorizer(encoding = 'latin-1', use_idf = True)),('svm', LinearSVC(C=1))])
scores = cross_validate(svm_clf_pipe, X, y, cv = 5, scoring = scoring, return_train_score = True)

In [19]:
sorted(scores.keys())

['fit_time',
 'score_time',
 'test_accuracy',
 'test_f1_macro',
 'test_f1_micro',
 'test_precision_macro',
 'test_recall_macro',
 'train_accuracy',
 'train_f1_macro',
 'train_f1_micro',
 'train_precision_macro',
 'train_recall_macro']

In [20]:
# retrieve scores from a metric
def get_metric_scores (scores, metric, train_or_test, verbose=False):
    metric_name = train_or_test + '_' + metric
    print(metric_name) 

    metric_scores = scores[metric_name]
    if (verbose == True):
        print(metric_scores)
    avg = sum(metric_scores) / len(metric_scores)
    print('average')
    avg_formatted = "{:.3f}".format(avg)
    print(avg_formatted)

In [21]:
#retrieve test accuracy scores
get_metric_scores(scores, 'accuracy', 'test', verbose=True)

test_accuracy
[0.74877651 0.7369281  0.75653595 0.73529412 0.72875817]
average
0.741


In [22]:
#retrieve training accuracy scores
get_metric_scores(scores, 'accuracy', 'train')

train_accuracy
average
0.999


In [23]:
# compare performance with different choice of C values

svm_clf_pipe2 = Pipeline([('vect', TfidfVectorizer(encoding = 'latin-1', use_idf = True)),('svm', LinearSVC(C = 0.5))])
scores2 = cross_validate(svm_clf_pipe2, X, y, cv = 5, scoring = scoring, return_train_score = True)

svm_clf_pipe3 = Pipeline([('vect', TfidfVectorizer(encoding = 'latin-1', use_idf = True)),('svm', LinearSVC(C = 2))])
scores3 = cross_validate(svm_clf_pipe3, X, y, cv = 5, scoring = scoring, return_train_score = True)

In [24]:
# compare the effect of different C values
# C=1
print('C=1\n')
get_metric_scores(scores, 'accuracy', 'train')
get_metric_scores(scores, 'accuracy', 'test')

# C=0.5
print('\nC=0.5\n')
get_metric_scores(scores2, 'accuracy', 'train')
get_metric_scores(scores2, 'accuracy', 'test')

# C=2
print('\nC=2\n')
get_metric_scores(scores3, 'accuracy', 'train')
get_metric_scores(scores3, 'accuracy', 'test')



C=1

train_accuracy
average
0.999
test_accuracy
average
0.741

C=0.5

train_accuracy
average
0.991
test_accuracy
average
0.744

C=2

train_accuracy
average
1.000
test_accuracy
average
0.735


***



## BERT

In [25]:
# install BERT sklearn wrapper written by charles9n

!git clone -b master https://github.com/charles9n/bert-sklearn
!cd bert-sklearn; pip install .

Cloning into 'bert-sklearn'...
remote: Enumerating objects: 259, done.[K
remote: Total 259 (delta 0), reused 0 (delta 0), pack-reused 259[K
Receiving objects: 100% (259/259), 516.15 KiB | 21.51 MiB/s, done.
Resolving deltas: 100% (131/131), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/bert-sklearn
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting boto3
  Downloading boto3-1.24.96-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 29.4 MB/s 
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.wh

In [26]:
df = p.read_csv("/content/drive/MyDrive/Colab Notebooks/pubmed_causal_language_use.csv")

In [27]:
# prepare train and test data
# since fine tuning a BERT model still requires a significant amount of time, 
# only 1000 training examples and 1000 test examples will be used for demo purpose.

dff = df.sample(frac = 1)
#train_size = int(0.01 * len(dff))
#df_train = dff[:train_size]
#df_test = dff[train_size:]
# df_train_svm = dff[:100000]
df_train_bert = dff[:5000]
df_test = dff[-1000:]
# print(df_train_svm.shape)
print(df_train_bert.shape)
print(df_test.shape)

# X_train_svm, y_train_svm = df_train_svm['Phrase'].values, df_train_svm['Sentiment'].values

X_train_bert, y_train_bert = df_train_bert['sentence'].values, df_train_bert['label'].values

X_test, y_test = df_test['sentence'].values, df_test['label'].values

(3061, 2)
(1000, 2)


In [28]:
# fine tune a BERT base uncased model
# since this wrapper has included vectorization using word embedding, no need to vectorize like in LinearSVC
# first the pre-trained BERT model will be loaded in
# then the training starts. 90% examples will be used as training examples and the other 10% as validation (parameter tuning)
# default setting is 3 epoch. Each epoch takes in some training data
from bert_sklearn import BertClassifier
model = BertClassifier()         # text/text pair classification
print(model)
model.fit(X_train_bert, y_train_bert)

Building sklearn text classifier...
BertClassifier()


100%|██████████| 231508/231508 [00:00<00:00, 254771.36B/s]


Loading bert-base-uncased model...


100%|██████████| 440473133/440473133 [00:36<00:00, 12004098.11B/s]
100%|██████████| 433/433 [00:00<00:00, 120497.19B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 2755, validation data size: 306



  cpuset_checked))


Training  :   0%|          | 0/87 [00:00<?, ?it/s]

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1174.)
  next_m.mul_(beta1).add_(1 - beta1, grad)


Validating:   0%|          | 0/39 [00:00<?, ?it/s]


Epoch 1, Train loss: 0.8107, Val loss: 0.4235, Val accy: 85.62%



Training  :   0%|          | 0/87 [00:00<?, ?it/s]

Validating:   0%|          | 0/39 [00:00<?, ?it/s]


Epoch 2, Train loss: 0.2608, Val loss: 0.3429, Val accy: 88.89%



Training  :   0%|          | 0/87 [00:00<?, ?it/s]

Validating:   0%|          | 0/39 [00:00<?, ?it/s]


Epoch 3, Train loss: 0.1028, Val loss: 0.3295, Val accy: 89.22%



BertClassifier(do_lower_case=True, label_list=array([0, 1, 2, 3]))

In [29]:
model.save('bert-sentiment.model')

In [30]:
# test accuracy
bert_acc = model.score(X_test, y_test)
print(bert_acc)

from sklearn.metrics import f1_score

# test f1_score
y_pred = model.predict(X_test)
bert_f1 = f1_score(y_test, y_pred, average = None)
bert_f1_macro = f1_score(y_test, y_pred, average = 'macro')

print(bert_f1)
print(bert_f1_macro)

Testing:   0%|          | 0/125 [00:00<?, ?it/s]


Loss: 0.0922, Accuracy: 97.90%
97.89999999999999


Predicting:   0%|          | 0/125 [00:00<?, ?it/s]

[0.9810901  0.96716418 0.95652174 0.98617512]
0.97273778338838


In [31]:
# print BERT model's confusion matrix
cm = confusion_matrix(y_test, y_pred, labels = [0,1,2,3])
print(cm)
# print()

[[441   4   2   4]
 [  4 162   1   1]
 [  0   1  55   1]
 [  3   0   0 321]]


In [32]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# print precision, recall macro avg
bert_precision = precision_score(y_test, y_pred, average = 'macro')
bert_recall = recall_score(y_test, y_pred, average = 'macro')

print(bert_precision)
print(bert_recall)

0.9710905296138189
0.9744414466814979


In [33]:
# get_metric_scores(scores, 'accuracy', 'test', verbose=True)

In [34]:
# BERT error analysis
print("no relationship examples that are mistakenly predicted as correlational -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
  if (y_test[i] == 0 and y_pred[i] == 3):
    print(X_test[i])
    err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
print("correlational examples that are mistakenly predicted as no relationship -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
  if (y_test[i] == 3 and y_pred[i] == 0):
    print(X_test[i])
    err_cnt = err_cnt + 1
print("errors:", err_cnt)

no relationship examples that are mistakenly predicted as correlational -

SLNB prior to NAT is an accurate method of axillary staging associated with a high identification rate.
Choroidal vascular density and volume are significantly reduced in more advanced stages of diabetic retinopathy.
For geriatrics, oncology, gastroenterology, and internal medicine, this ratio was even greater (1 out of 3â€“4).
Weight gain and growth equaled that of healthy peers at 9 to 12 years but was suboptimal at 5 to 8 years and dramatically declines at 13 to 16 years.
errors: 4

correlational examples that are mistakenly predicted as no relationship -

The majority of fetuses are not acidemic even when the FHR trace is pathological and stimulation tests are only helpful when accelerations are provoked.
RNFL and GC-IPL complex thickness remained unaffected following LASIK intervention.
Patients with new onset rheumatoid arthritis at our institution received similar disease modifying anti-rheumatic drug tre