# Triple validation with SVM and Naive Bayes

### Introduction

This notebook contains an continuation on effort to extract RDF triples from raw text and focuses on the next step in the process - validation of the extracted triples.

#### Triples

#### Naive Bayes

#### Support Vector Machines

### Triple Validation

#### Data

The data consists of an csv file containig output from method extracting triples from raw text. The labels are manually added to the data.

There is no true testing data set, but there is small set for human evaluation containig 10 pairs of negative/possitive triples. These pairs were made by removing negative triples from the initial dataset and editing them to become possitive. The triples were removed to ensure that the model has not been tained ot validated on this data.

#### Trainig classifiers


In [25]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score, f1_score, classification_report #precision_score, recall_score

In [47]:
triple_data = pd.read_csv('data/train_triples.csv', names=['triple', 'label'])

Lets see how the data.

In [48]:
triple_data

Unnamed: 0,triple,label
0,luck can plot,0
1,use is period,0
2,sword is swordsmanship,0
3,system are size,0
4,spear is thrower,0
...,...,...
2058,light is interaction,0
2059,size be mass,0
2060,month has group,0
2061,problem be number,0


The labels correspond to negative example with label 0 and possitive exmple with label 1.

The extraction method has high yeld, but with few possitive examples.

In [49]:
triple_data['label'].value_counts()

0    1910
1     153
Name: label, dtype: int64

As the data does not have train-test split it is possible to use the splitter from scikit. Testing multiple models with the same data makes the test split validation set.

The split is 80/20 to maximize the number of possitive examples.

In [5]:
train_triples, validate_triples, train_labels, validate_labels = model_selection.train_test_split(triple_data['triple'], triple_data['label'], test_size=0.20)

Lets make sure that the validation labels contain positive class labels.

In [6]:
validate_labels.value_counts()

0    379
1     34
Name: label, dtype: int64

The triples column needs to be converted into vectors. Scikit learn offers Count Vectorizera and TF-IDF vectorizer. Lets test them and see who is able to help with the triples.

In [7]:
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(triple_data['triple'])
train_triple_tfidf = tfidf_vect.transform(train_triples)
validate_triple_tfidf = tfidf_vect.transform(validate_triples)

In [8]:
count_vect = CountVectorizer()
count_vect.fit(triple_data['triple'])
train_triple_count = count_vect.transform(train_triples)
validate_triple_count = count_vect.transform(validate_triples)

In [9]:
print(train_triple_tfidf[:20])

  (0, 1452)	0.5735386140944267
  (0, 380)	0.6255158062866324
  (0, 119)	0.5289455872093353
  (1, 1344)	0.5985091829588916
  (1, 205)	0.8011159453623927
  (2, 372)	0.6444235080799443
  (2, 117)	0.3685205735582747
  (2, 19)	0.6700081559938044
  (3, 1389)	0.598176134792705
  (3, 947)	0.6833627921239789
  (3, 611)	0.41856971474890436
  (4, 568)	0.716404371262967
  (4, 186)	0.6359747460045079
  (4, 68)	0.2868813330975282
  (5, 1004)	0.6747731595054057
  (5, 710)	0.2489062186913345
  (5, 145)	0.6947854902830612
  (6, 1501)	0.3353780766423881
  (6, 1389)	0.6445582049628327
  (6, 625)	0.687070786835493
  (7, 1321)	0.6191787340024916
  (7, 963)	0.6819187539060186
  (7, 611)	0.3893642361983631
  (8, 1501)	0.29870100432599034
  (8, 614)	0.6009335804991323
  :	:
  (11, 1306)	0.953957094393142
  (11, 611)	0.2999430980319334
  (12, 860)	0.734768133234705
  (12, 618)	0.3865557138688791
  (12, 438)	0.5573961521737552
  (13, 653)	0.6437970608463044
  (13, 117)	0.37986010889667665
  (13, 37)	0.664252694

In [10]:
print(train_triple_count[:20])

  (0, 119)	1
  (0, 380)	1
  (0, 1452)	1
  (1, 205)	1
  (1, 1344)	1
  (2, 19)	1
  (2, 117)	1
  (2, 372)	1
  (3, 611)	1
  (3, 947)	1
  (3, 1389)	1
  (4, 68)	1
  (4, 186)	1
  (4, 568)	1
  (5, 145)	1
  (5, 710)	1
  (5, 1004)	1
  (6, 625)	1
  (6, 1389)	1
  (6, 1501)	1
  (7, 611)	1
  (7, 963)	1
  (7, 1321)	1
  (8, 581)	1
  (8, 614)	1
  :	:
  (11, 611)	1
  (11, 1306)	2
  (12, 438)	1
  (12, 618)	1
  (12, 860)	1
  (13, 37)	1
  (13, 117)	1
  (13, 653)	1
  (14, 710)	1
  (14, 915)	1
  (14, 1294)	1
  (15, 119)	1
  (15, 1294)	1
  (15, 1460)	1
  (16, 217)	1
  (16, 862)	1
  (16, 1501)	1
  (17, 465)	1
  (17, 826)	1
  (17, 1515)	1
  (18, 710)	1
  (18, 1004)	1
  (19, 553)	1
  (19, 611)	1
  (19, 800)	1


Looks like the count vectorizer returns only one score and the TF-IDF returns meaningful calculations.

Having the triples as vectors it is time to train some moddels. The training set will be fitted on four classifiers - Naive Bayes and SVM with 3 types of kernel: linear (), rbf and polynomial ().

In [11]:
naive = naive_bayes.MultinomialNB()
naive.fit(train_triple_tfidf, train_labels)
predictions_NB = naive.predict(validate_triple_tfidf)

In [12]:

svm_lin = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
svm_lin.fit(train_triple_tfidf, train_labels)
predictions_SVM = svm_lin.predict(validate_triple_tfidf)

In [13]:
rbf_svm = svm.SVC(kernel='rbf', gamma=0.5, C=1.0)
rbf_svm.fit(train_triple_tfidf, train_labels)
rbf_pred = rbf_svm.predict(validate_triple_tfidf)

In [14]:
poly_svm = svm.SVC(kernel='poly', degree=3, C=1)
poly_svm.fit(train_triple_tfidf, train_labels)
poly_pred = poly_svm.predict(validate_triple_tfidf)

#### Classifier scoring

The metrics of accuracy and F1 are the most commonly used when working with supervised machne learning.

In [15]:
print("Naive Bayes Accuracy Score -> ", accuracy_score(validate_labels, predictions_NB )*100)
print("Naive Bayes F1 Score -> ", f1_score(validate_labels, predictions_NB, average='weighted')*100)

print("SVM Linear Accuracy Score -> ", accuracy_score(validate_labels, predictions_SVM)*100)
print("SVM Linear F1 Score -> ", f1_score(validate_labels, predictions_SVM, average='weighted')*100)

print("SVM-RBF Accuracy Score -> ", accuracy_score(validate_labels, rbf_pred)*100)
print("SVM-RBF F1 Score -> ", f1_score(validate_labels, rbf_pred, average='weighted')*100)

print("SVM-POLY Accuracy Score -> ", accuracy_score(validate_labels, poly_pred)*100)
print("SVM-POLY F1 Score -> ", f1_score(validate_labels, poly_pred, average='weighted')*100)

Naive Bayes Accuracy Score ->  91.76755447941889
Naive Bayes F1 Score ->  87.82803825176705
SVM Linear Accuracy Score ->  91.52542372881356
SVM Linear F1 Score ->  87.7070432192676
SVM-RBF Accuracy Score ->  91.76755447941889
SVM-RBF F1 Score ->  87.82803825176705
SVM-POLY Accuracy Score ->  91.76755447941889
SVM-POLY F1 Score ->  87.82803825176705


Looks like the linear SVM model is slightly worse than other models. With so little data the rest of the models work exactly the same. 

#### Classifier evaluation

Scoring is informative, but lets test it with new data and see the results.

In [17]:
manual_eval = pd.read_csv('data/triple_pairs.csv', names=['triple', 'label'])

In [18]:
me_triples = manual_eval['triple']

In [43]:
predict_triples_tfidf = tfidf_vect.transform(me_triples)

In [20]:
new_predictions_SVM = rbf_svm.predict(predict_triples_tfidf)

In [21]:
new_predictions_NB = naive.predict(predict_triples_tfidf)

In [22]:
manual_eval['nb'] = new_predictions_NB
manual_eval['svm'] = new_predictions_SVM

In [23]:
manual_eval

Unnamed: 0,triple,label,nb,svm
0,age had horse,0,0,0
1,horse has age,1,0,0
2,maple have leave,0,0,0
3,maple have leaf,1,0,0
4,mountain are summit,0,0,0
5,mountain has summit,1,0,0
6,culture is proto-indo-europeans,0,0,0
7,culture is proto-indo-european,1,0,0
8,sword are thrusting,0,0,0
9,sword is thrusting,1,0,0


The models predict only the negative class. Maybe a classifiacation report will help with expainability.

In [24]:
train_test = rbf_svm.predict(train_triple_tfidf)
print(classification_report(train_labels, train_test))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96      1531
           1       1.00      0.03      0.07       119

    accuracy                           0.93      1650
   macro avg       0.97      0.52      0.51      1650
weighted avg       0.94      0.93      0.90      1650



As expected the low number of possitive triples affects the recall of the models.

## Conlusion

It seems that the model learns the negative class.

The task comes after a highly productive extraction methos that results in low amount of positive examlples.

The nature of the task is such that new data can be added, but the ratio of possitive examples will remain around 10% from the total number of extracted triples.

## References
[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

[SVM](https://en.wikipedia.org/wiki/Support_vector_machine)

[Text classification with SVM](https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34)

[SVM multyclass classification](https://www.baeldung.com/cs/svm-multiclass-classification)
