# TourBERT traveling embeddings

Here we use **TourBERT** sentiment analysis model to create word vectors for traveling-based questions. TourBERT is a pre-trained NLP model to analyze sentiment of tourism-based text. This approach does not give better results than other approaches we tried earlier, in fact, have slightly worse performance:
| Model | Score |
| --- | --- |
| Stemming | 97.8% |
| Lemmatization | 96.4% |
| N-grams | 96.3% |
| Stemming + Stop words | 98.3% |
| Custom word vectors combined with IDF | 98.0% |
| Custom word vectors combined with POS+NER | 97.4% |
| Pretrained word vectors | 95.8% |
| Embeddings from pretrained TourBERT | 91.5% |

In [3]:
import json
from typing import Collection

import numpy as np
import pandas as pd
import torch
from sklearn.neighbors import NearestNeighbors
from transformers import BertTokenizer, BertModel

N_NEIGHBOURS = 100

In [4]:
def batch(iterable: Collection, n: int = 1) -> Collection:
    """
    Yields n batches from input iterable

    :param iterable: Collection
        input iterable
    :param n: int
        number of batches
    :return:
        batch iterable
    """
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]

In [6]:
df = pd.read_csv("../../../data/traveling_qna_dataset.csv", sep="\t")
df.drop(columns=df.columns[0], axis=1, inplace=True)
questions = np.unique(df.iloc[:, 0].to_numpy())

# tokenizer = BertTokenizer.from_pretrained("veroman/TourBERT")
# model = BertModel.from_pretrained("veroman/TourBERT")

# Transform and save input questions
# outputs = []
# for i, q in enumerate(batch(questions, 32)):
#     encoded_text = self.tokenizer(q, return_tensors="pt", padding=True)
#     with torch.no_grad():
#         output = model(**encoded_text)[1].numpy().reshape(-1)
#     output = np.array_split(output, len(q))
#     outputs += output
#     print((i + 1) * 32)

# outputs = np.asarray(outputs)
# np.save('tourbert_emmbedings.npy', outputs)
outputs = np.load("tourbert_emmbedings.npy")

knn = NearestNeighbors(n_neighbors=100, metric="cosine").fit(outputs)

with open("../../../data/test_questions_json.json") as json_file:
    json_data = json.load(json_file)

test_questions = json_data["question"]
original = json_data["original"]

# Transform and save test questions
# encoded_text = self.tokenizer(test_questions, return_tensors="pt", padding=True)
# with torch.no_grad():
#     output = model(**encoded_text)[1].numpy().reshape(-1)
# output = np.array_split(output, 60)
# tq = np.asarray(output)
# np.save('tourbert_emmbedings_test.npy', tq)
tq = np.load("tourbert_emmbedings_test.npy")

_, indices = knn.kneighbors(tq)

indices_original = np.asarray([questions.tolist().index(o) for o in original])

rank = np.where(indices == indices_original[:, None])[1]
penalization = (indices_original.shape[0] - rank.shape[0]) * 2 * knn.n_neighbors
score = (rank.sum() + penalization) / indices_original.shape[0]

print(f"Score: {100 - score / (2 * N_NEIGHBOURS) * 100:.2f}%")


Score: 91.48%
