# Multilingual Search Engine

*__Author:__ Tomas Ruan Rollan & Rian Hoorelbeke*
<br>
*__LinkedIn:__ https://www.linkedin.com/in/tomas-ruan/*
<br>
*__Email:__ tomruarol@gmail.com | rian.hoorelbeke@gmail.com*

### Imports

Works only for Linux/ MAC OS (There is no tensorflow_text and faiss for Windows)

In [1]:
# Library for data manipulation
import pandas as pd

# Libraries for deep learning 
import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_text

# Libraries for NLP
from flair.embeddings import BertEmbeddings, DocumentPoolEmbeddings
from flair.data import Sentence

# Library for Abstract Base Clases
from abc import ABCMeta, abstractmethod

# Library for efficient similarity search and clustering of dense vectors (vector with a lot of non-zero values)
import faiss

# Library for progress bars
from tqdm import tqdm

### Data Load

Data must be downloaded from Kaggle competition: <br>
https://www.kaggle.com/c/quora-question-pairs/data <br>
The Quora dataset is composed of pairs of questions to see if they have the same meaning.

In [2]:
# Using the library for data manipulation we can read the data and put it into a dataframe
data = pd.read_csv('data/train.csv')
# Show the first pairs of questions
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
# We remove missing values, dropna returns the DataFrame with NA entries dropped from it
data.dropna(inplace=True)

In [4]:
# Base encoder to be used for the following encoders
class TFEncoder(metaclass=ABCMeta):
    def __init__(self, model_path:str):
        self.model = hub.load(model_path)

In [5]:
# Universal sentence encoder that works with multiple languages
# It can embed text from 16 languages into a shared semantic embedding space
class USE(TFEncoder):
    def __init__(self, model_path):
        super().__init__(model_path)
        
    def encode(self, text):
        return self.model(text).numpy()

In [6]:
# Universal sentence encoder trained on Question Answer pairs
class USEQA(TFEncoder):
    def __init__(self, model_path):
        super().__init__(model_path)
        
    def encode(self, text):
        return self.model.signatures['question_encoder'](tf.constant(s))['outputs'].numpy()

In [7]:
# BERT is an open-sourced language model introduced by Google, to help computers 
#understand the meaning of ambiguous language in text by using surrounding text to establish context
class BERT():
    def __init__(self, model_name, layers="-2", pooling_operation="mean"):
        self.embeddings = BertEmbeddings(model_name, layers=layers, pooling_operation=pooling_operation)
        self.document_embeddings = DocumentPoolEmbeddings([self.embeddings], fine_tune_mode='nonlinear')
        
    def encode(self, text):
        sentence = Sentence(text)
        self.document_embeddings.embed(sentence)
        return sentence.embedding.detach().numpy().reshape(1, -1)

In [8]:
# model_path = 'https://tfhub.dev/google/universal-sentence-encoder-qa/3'
# model_path = '../../models/universal-sentence-encoder-qa3/'

# https://arxiv.org/pdf/1803.11175.pdf
# model_path = '../../models/universal-sentence-encoder-large5/' #best for english

# Use the correct path to load the model for the USE encoder
model_path = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3"
# model_path = '../../models/universal-sentence-encoder-multilingual-large3/'

# encoder = BERT('bert-base-uncased')
encoder = USE(model_path)

INFO:absl:Using /var/folders/0s/skg4xy3d4_z6br1c9rlxqpg00000gn/T/tfhub_modules to cache modules.


In [9]:
# Encode the word 'hello' into a vector of size 1x512
encoder.encode(['hello']).shape

(1, 512)

In [10]:
# The length of the vector representing 'hello' has a length of 512, 
#this will be used for the dimensions of the indexer class
d = encoder.encode(['hello']).shape[-1]
d

512

#### Faiss Class
Indexer class which will store all embeddings efficiently for fast vector search

In [11]:
class FAISS:
    # Function to initialize an object of the indexer class, where the dimensions are given as the argument
    def __init__(self, dimensions:int):
        self.dimensions = dimensions
        self.index = faiss.IndexFlatL2(dimensions)
        self.vectors = {}
        self.counter = 0
    
    # Function to add index of a question to the indexer
    def add(self, text:str, v:list):
        self.index.add(v)
        self.vectors[self.counter] = (text, v)
        self.counter += 1
    
    # Function to search the similarity of a question with a dataset of questions
    # Prints a list of questions out of the dataset that are the most similar with the question
    def search(self, v:list, k:int=10):
        distance, item_index = self.index.search(v, k)
        for dist, i in zip(distance[0], item_index[0]):
            if i == -1:
                break
            else:
                print(f'{self.vectors[i][0]}, %.2f'%dist)

#### Vector Search Test
We test the similarity of the words 'hello' and 'bye' compared to 'hi'

In [12]:
index = FAISS(d)

# index word
t1 = 'hello'
v1 = encoder.encode([t1])
index.add(t1, v1)

# index word
t1 = 'bye'
v1 = encoder.encode([t1])
index.add(t1, v1)

# search similar word
t1 = 'hi'
v1 = encoder.encode([t1])

# 'hi' and 'hello' are similar so they assign a small word distance
# 'hi' and 'bye' are not similar so they assign a large word distance
print('word,  distance')
index.search(v1)

word,  distance
hello, 0.07
bye, 0.83


#### Reduce the size of dataset
The Quora dataset is huge and takes a lot of time, so we will take only a sample of 1% of the data.

In [13]:
# Let's take a smaller amount of the dataset, here we take a random sample of 1% of the data
reduce_data = data.sample(frac=0.01, random_state=1)

# We use only the questions of the first column of paired questions
subset_to_ask = reduce_data.question1.values
# 4043 questions remain
len(subset_to_ask)

4043

#### Generate Embeddings and Index all questions
Encoding and indexing of approximately 4000 questions takes about 3mins to complete.

In [14]:
# Loop through the dataset of questions and generate embeddings and assign indices 
#(using tqdm to show a bar to see the progress)
for question in tqdm(subset_to_ask):
    embed = encoder.encode([question])
    index.add(question, embed)

100%|██████████| 4043/4043 [03:39<00:00, 18.46it/s]


In [15]:
# Search function that embedds the question that is given as an argument and searches 
#through the dataset of questions to find similarity
def search(s, k=10):
    embed = encoder.encode([s])
    index.search(embed, k)

#### Search Examples
Now we investigate the results using the same question in different languages.
The function 'search' returns a list of questions out of the dataset that are similar.
A small value shows a greater similarity.

In [16]:
print('English')
search('What are your 10/10 movies?')

English
Which are the must watch movies?, 0.59
What are best Hollywood movies?, 0.61
What are the best Hollywood movies ever?, 0.63
What are some of the movies of Hollywood that you must watch?, 0.64
List of best Hollywood movies 2016?, 0.66
What movie can you watch all the time and never get tired of watching?, 0.67
Which is the best movie ever?, 0.69
Which are best Hollywood classic movies of all time?, 0.70
What are your top 3 movie genres?, 0.74
What are the 10 greatest horror movies of all time?, 0.75


In [17]:
print('Spanish')
search('¿Cuáles son tus películas 10/10?')

Spanish
What are the best Hollywood movies ever?, 0.63
What are best Hollywood movies?, 0.63
Which are the must watch movies?, 0.65
What are some of the movies of Hollywood that you must watch?, 0.65
List of best Hollywood movies 2016?, 0.69
Which is the best movie ever?, 0.70
Which are best Hollywood classic movies of all time?, 0.70
What movie can you watch all the time and never get tired of watching?, 0.70
What are your top 3 movie genres?, 0.75
What is the greatest movie ever?, 0.77


In [18]:
print('German')
search('Was sind deine 10/10 Filme?')

German
What movie can you watch all the time and never get tired of watching?, 0.66
What are best Hollywood movies?, 0.66
What are the best Hollywood movies ever?, 0.66
What are some of the movies of Hollywood that you must watch?, 0.67
Which is the best movie ever?, 0.69
Which are the must watch movies?, 0.71
Which are best Hollywood classic movies of all time?, 0.73
List of best Hollywood movies 2016?, 0.75
What are your top 3 movie genres?, 0.75
What is the greatest movie ever?, 0.75


In [19]:
print('Russian')
search('Какие у тебя фильмы 10/10?')

Russian
Which are the must watch movies?, 0.69
What are best Hollywood movies?, 0.70
List of best Hollywood movies 2016?, 0.71
What are the best Hollywood movies ever?, 0.73
What are some of the movies of Hollywood that you must watch?, 0.77
Which are best Hollywood classic movies of all time?, 0.77
What are your top 3 movie genres?, 0.77
What are the 10 greatest horror movies of all time?, 0.80
What movie can you watch all the time and never get tired of watching?, 0.82
Which is the best movie ever?, 0.83


In [20]:
print('Chinese')
search('你的10/10电影是什么？')

Chinese
What movie can you watch all the time and never get tired of watching?, 0.64
Which is the best movie ever?, 0.71
What are some of the movies of Hollywood that you must watch?, 0.72
What is the greatest movie ever?, 0.73
What are the best Hollywood movies ever?, 0.74
What are best Hollywood movies?, 0.75
Which are the must watch movies?, 0.81
Which are best Hollywood classic movies of all time?, 0.82
What are your top 3 movie genres?, 0.86
What is the best comedy movie ever?, 0.87


In [21]:
print('Japanese')
search('あなたの10/10の映画は何ですか？')

Japanese
Which is the best movie ever?, 0.60
What movie can you watch all the time and never get tired of watching?, 0.61
What is the greatest movie ever?, 0.64
What are some of the movies of Hollywood that you must watch?, 0.66
What are the best Hollywood movies ever?, 0.67
What are best Hollywood movies?, 0.69
Which are the must watch movies?, 0.74
Which are best Hollywood classic movies of all time?, 0.74
What are the 10 greatest horror movies of all time?, 0.81
List of best Hollywood movies 2016?, 0.81
