# Triple validation with SVM and Naive Bayes

### Introduction

This notebook contains an continuation on effort to extract RDF triples from raw text and focuses on the next step in the process - validation (classifiacation) of the extracted triples. Informatin on the extraction method can be found [here](https://github.com/radev2711/DataScience/blob/main/Triples_from_text.ipynb).

#### Triples

A semantic triple is the core data entity in the Resource Description Framework (RDF) data model. It it a statement that repesents semantic data in a set of three entites in the subject–predicate–object format. Every entitry in the triple has it own unique URI. This format enables knowledge to be represented in a machine-readable way, thus allowing machines to querie and reasone about semantic data. 

In the subject–predicate–object format in the example *turtles are reptiles* the subject (sometimes called head) and the object (sometimes called tail) represent two entities being related; the predicate represents the nature of their relationship. This is similar to knowledge representations in graphs were the subject and the object will be nodes and the predicate will be edge (arc).

#### Naive Bayes

The naive Bayes classifiers are a family of linear "probabilistic classifiers" based on applying Bayes' theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event, with strong (naive) independence assumptions between the features minimizing the probability of misclassification.

Naive Bayes classifiers are highly scalable, requirie a number of parameters linear in the number of features in a learning problem. The construction of these classifiers can be defined as simple: models assign class labels to instances, represented as vectors of feature values, where the class labels are drawn from some finite set. All naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood.

Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of this approach is that it only requires a small amount of training data to estimate the parameters necessary for classification.

#### Support Vector Machines

Support vector machines (SVMs) are supervised learning models in machine learning, with associated learning algorithms that analyze data for classification and regression analysis. The SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or Vapnik–Chervonenkis theory (VC) theory. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. The SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. Some of the common kernels include the Gaussian radial basis function and Polinomial kernel.

The polynomial kernel is a function that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. This kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these (interaction features). The feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features.

The radial basis function (RBF) kernel is a function, that computes the similarity between two points or how close they are to each other. The value of the RBF kernel decreases with distance and ranges between zero and one where zere denotes that the two points are the same. The RBF Kernel is popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during training and not the entire dataset.


### Triple Validation

Triple classification or validation aims to judge whether a given triple is correct or not, which is a binary classification task. 

#### Data

The data consists of an csv file containig output from method extracting triples from raw text. The labels are manually added to the data.

There is no true testing data set, but there is small set for human evaluation containig 10 pairs of negative/possitive triples. These pairs were made by removing negative triples from the initial dataset and editing them to become possitive. The triples were removed to ensure that the model has not been tained ot validated on this data.

#### Trainig classifiers


In [2]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score, f1_score, classification_report #precision_score, recall_score

In [3]:
triple_data = pd.read_csv('data/train_triples.csv', names=['triple', 'label'])

As a first step, lets see the data.

In [4]:
triple_data

Unnamed: 0,triple,label
0,luck can plot,0
1,use is period,0
2,sword is swordsmanship,0
3,system are size,0
4,spear is thrower,0
...,...,...
2058,light is interaction,0
2059,size be mass,0
2060,month has group,0
2061,problem be number,0


The labels correspond to negative example with label 0 and possitive exmple with label 1.

The extraction method has high yeld, but with few possitive examples.

In [5]:
triple_data['label'].value_counts()

0    1910
1     153
Name: label, dtype: int64

As the data does not have train-test split it is possible to use the splitter from scikit. Testing multiple models with the same data makes the test split validation set.

The split is 80/20 to maximize the number of possitive examples.

In [28]:
train_triples, validate_triples, train_labels, validate_labels = model_selection.train_test_split(triple_data['triple'], triple_data['label'], test_size=0.20)

Lets make sure that the validation labels contain positive class labels.

In [29]:
validate_labels.value_counts()

0    380
1     33
Name: label, dtype: int64

The triples column needs to be converted into vectors. Scikit learn offers Count Vectorizera and TF-IDF vectorizer. Lets test them and see who is able to help with the triples.

In [30]:
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(triple_data['triple'])
train_triple_tfidf = tfidf_vect.transform(train_triples)
validate_triple_tfidf = tfidf_vect.transform(validate_triples)

In [31]:
count_vect = CountVectorizer()
count_vect.fit(triple_data['triple'])
train_triple_count = count_vect.transform(train_triples)
validate_triple_count = count_vect.transform(validate_triples)

In [32]:
print(train_triple_tfidf[:10])

  (0, 1464)	0.6970607756375621
  (0, 1345)	0.6614612313135307
  (0, 623)	0.2767224503663339
  (1, 1099)	0.6916152959494455
  (1, 710)	0.20817436157587133
  (1, 648)	0.6916152959494455
  (2, 1010)	0.9595970239998503
  (2, 623)	0.281377951393905
  (3, 1203)	0.6916152959494455
  (3, 795)	0.6916152959494455
  (3, 710)	0.20817436157587133
  (4, 1294)	0.6251106533852683
  (4, 887)	0.6890891308601802
  (4, 623)	0.36659629124502086
  (5, 991)	0.6208267302626957
  (5, 827)	0.699496567998095
  (5, 116)	0.3539473440361099
  (6, 1469)	0.6873813688372215
  (6, 1276)	0.6216279272720678
  (6, 173)	0.37561359641378883
  (7, 1489)	0.7531696638101186
  (7, 710)	0.21512417643155518
  (7, 382)	0.6216566948330872
  (8, 1123)	0.8190873899151332
  (8, 980)	0.5237960861091602
  (8, 710)	0.2339519349326691
  (9, 1518)	0.7048112119987632
  (9, 1517)	0.6234670472114084
  (9, 618)	0.338422807272681


In [33]:
print(train_triple_count[:10])

  (0, 623)	1
  (0, 1345)	1
  (0, 1464)	1
  (1, 648)	1
  (1, 710)	1
  (1, 1099)	1
  (2, 623)	1
  (2, 1010)	2
  (3, 710)	1
  (3, 795)	1
  (3, 1203)	1
  (4, 623)	1
  (4, 887)	1
  (4, 1294)	1
  (5, 116)	1
  (5, 827)	1
  (5, 991)	1
  (6, 173)	1
  (6, 1276)	1
  (6, 1469)	1
  (7, 382)	1
  (7, 710)	1
  (7, 1489)	1
  (8, 710)	1
  (8, 980)	1
  (8, 1123)	1
  (9, 618)	1
  (9, 1517)	1
  (9, 1518)	1


Looks like the count vectorizer returns only one score and the TF-IDF returns meaningful calculations.

Having the triples as vectors it is time to train some models. The training set will be fitted on four classifiers - Naive Bayes and SVM with 3 types of kernel: linear, rbf and polynomial, to look for best performer or difference in performance.

In [34]:
# Train naive Bayes
naive = naive_bayes.MultinomialNB()
naive.fit(train_triple_tfidf, train_labels)
predictions_NB = naive.predict(validate_triple_tfidf)

In [35]:
# Train leinear SVM
svm_lin = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
svm_lin.fit(train_triple_tfidf, train_labels)
predictions_SVM = svm_lin.predict(validate_triple_tfidf)

In [36]:
# Train Gausean SVM
rbf_svm = svm.SVC(kernel='rbf', gamma=0.5, C=1.0)
rbf_svm.fit(train_triple_tfidf, train_labels)
rbf_pred = rbf_svm.predict(validate_triple_tfidf)

In [37]:
# Train polinomial SVM
poly_svm = svm.SVC(kernel='poly', degree=3, C=1)
poly_svm.fit(train_triple_tfidf, train_labels)
poly_pred = poly_svm.predict(validate_triple_tfidf)

#### Classifier scoring

The metrics of accuracy and F1 are the most commonly used when working with supervised machne learning.

In [38]:
print("Naive Bayes Accuracy Score -> ", accuracy_score(validate_labels, predictions_NB )*100)
print("Naive Bayes F1 Score -> ", f1_score(validate_labels, predictions_NB, average='weighted')*100)

print("SVM Linear Accuracy Score -> ", accuracy_score(validate_labels, predictions_SVM)*100)
print("SVM Linear F1 Score -> ", f1_score(validate_labels, predictions_SVM, average='weighted')*100)

print("SVM-RBF Accuracy Score -> ", accuracy_score(validate_labels, rbf_pred)*100)
print("SVM-RBF F1 Score -> ", f1_score(validate_labels, rbf_pred, average='weighted')*100)

print("SVM-POLY Accuracy Score -> ", accuracy_score(validate_labels, poly_pred)*100)
print("SVM-POLY F1 Score -> ", f1_score(validate_labels, poly_pred, average='weighted')*100)

Naive Bayes Accuracy Score ->  92.00968523002422
Naive Bayes F1 Score ->  88.1807828181821
SVM Linear Accuracy Score ->  91.76755447941889
SVM Linear F1 Score ->  88.05977450045248
SVM-RBF Accuracy Score ->  92.00968523002422
SVM-RBF F1 Score ->  88.1807828181821
SVM-POLY Accuracy Score ->  92.00968523002422
SVM-POLY F1 Score ->  88.1807828181821


Looks like the one of the SVM models is slightly different than the others. With so little data the rest of the models work exactly the same. 

#### Classifier evaluation

Scoring is informative and helps with selecting the best performing model, but lets test it with new data and look at the results.

In [39]:
manual_eval = pd.read_csv('data/triple_pairs.csv', names=['triple', 'label'])

In [40]:
me_triples = manual_eval['triple']

In [41]:
predict_triples_tfidf = tfidf_vect.transform(me_triples)

As the models perform equally lets select the Naive Bayes and one SVM model and vizualize their predictions.

In [47]:
new_predictions_svm = poly_svm.predict(predict_triples_tfidf)

In [43]:
new_predictions_nb = naive.predict(predict_triples_tfidf)

In [48]:
manual_eval['nb'] = new_predictions_nb
manual_eval['svm'] = new_predictions_svm

In [49]:
manual_eval

Unnamed: 0,triple,label,nb,svm
0,age had horse,0,0,0
1,horse has age,1,0,0
2,maple have leave,0,0,0
3,maple have leaf,1,0,0
4,mountain are summit,0,0,0
5,mountain has summit,1,0,0
6,culture is proto-indo-europeans,0,0,0
7,culture is proto-indo-european,1,0,0
8,sword are thrusting,0,0,0
9,sword is thrusting,1,0,0


The models predict only the negative class. Maybe a classifiacation report will help with expainability.

In [46]:
train_test = poly_svm.predict(train_triple_tfidf)
print(classification_report(train_labels, train_test, zero_division=1))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1530
           1       1.00      0.70      0.82       120

    accuracy                           0.98      1650
   macro avg       0.99      0.85      0.91      1650
weighted avg       0.98      0.98      0.98      1650



As expected the low number of possitive triples affects the recall of the models.

## Conlusion

The results show that the models learn only the negative class. The triple validation comes after a highly productive extraction method that results in low amount of positive examlples. The nature of the task is such that new data can be added, but the ratio of possitive examples will remain around 10% from the total number of extracted triples.

## References
[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

[SVM](https://en.wikipedia.org/wiki/Support_vector_machine)

[Text classification with SVM](https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34)

[SVM multyclass classification](https://www.baeldung.com/cs/svm-multiclass-classification)
