# Find the correct action according to text query (command)

Our issue is to find an action that is coherent with a command that is given to the model. How to find a correct answer ? Let's take the example of Alexa. Alexa probably uses a neural-network model in order to get a coherent response. This can work with huge neural networks on cloud-driven architectures. However, on a simpler architecture, it may seem harder to do so. 



## Method 1 : Incremental database learning

This method uses a command database in order to find the adequate answer. The user will give a command string to the model and the model will have to answer. If the answer is not what the user wants, then the user associates an action to his command and the database is updated.

![](commands-doc/IDL-flow1.svg)

### Query preprocessing

The goal here is to clean the query from unnecessary words (such as articles).

In [1]:
import time

def remove_punctuation_and_cap(query: str) -> str:
    return query.replace(",", "").replace(".", "").lower()
            
def remove_stopwords(query: str, words: set) -> list:
    processed_list = list()
    
    for word in query.split():
        if word not in words:
            processed_list.append(word)
            
    return processed_list

def clean_query(query: str, words: set) -> list:
    
    start = time.time()

    query = remove_punctuation_and_cap(query)
            
    return remove_stopwords(query, words)

print(clean_query("le elijah, éteins la télévision le le", ["elijah", "la", "le", "cette"]))

['éteins', 'télévision']


## How to modelize the data ?

### Naive representation

The data could be modelized as a word list from a query and as a response string. These word lists would be hashed in order to be stored. The hash should have low-dispersion in order for queries to be close to each other when they share the same semantics

![](commands-doc/data-storage.svg)

### Vectorizing the words

Another solution would be to vectorize the words before. With this, we could be able to classify words, which would be quite useful. Classes would represent the answer that the model would give to the user. Thanks to this classification, we could say "This vector has xx% of chances to match the question". Database selection could be done with the distance to the barycenter of the topics.

![](commands-doc/data-storage2.svg)

## Searching

### Naive search

The easiest version should be to compare query hashes. This is easy to implement but quite inefficient with highly-changing commands. As a consequence, using this search method is not advised as the user would have to put a lot of different commands' aliases when the model could guess what the user wants.

### Vector-based search

The challenge here is to give a value to each word in order to build a vector. This said vector is going to be transformed in a scalar value with a barycenter computation.

#### Naive method

We keep each word in memory and assign a random weight to them. When a vector is computed, we look in this table. Consistency is then kept, however, we can't be sure that it is going to be coherent and efficient, especially with new values...

##### Implementation

Those values must be initialized for the words below

In [2]:
import random
import numpy as np

barycenter_list = list()
word_map = dict() # map < String, int >

known_words = set(["télévision", "allumer", "éteindre"])

max_range = 200

for word in known_words:
    word_map[word] = (random.randrange(-max_range, max_range))/max_range

##### Computation

In [3]:
def valuation(word : str) -> float:
    if word in word_map:
        return word_map[word]
    word_map[word] = (random.randrange(-max_range, max_range))/max_range
    return word_map[word]

print(valuation("télévision"))
print(valuation("Poupipou"))

-0.31
0.915


For each word in the sentence (that has been cleaned), a valuation is associated to it and put in the vector.

In [4]:
def vectorize(sentence : list) -> np.array:
    v = np.zeros((1, len(sentence)))
    
    for i, word in enumerate(sentence):
        v[0][i] = valuation(word)
    
    return v

In [5]:
%%script false --no-raise-error
# ^ Prevents execution
sentence = "Elijah, allume la télévision"
stoplist = ["elijah", "le", "la"]
words = clean_query(sentence, stoplist)


v1 = vectorize(clean_query("Elijah, allume la télévision", stoplist))
v2 = vectorize(clean_query("Elijah, éteins la télévision", stoplist))
v3 = vectorize(clean_query("Elijah, allume la télé", stoplist))
v4 = vectorize(clean_query("Elijah, éteins la télé", stoplist))

print(np.linalg.norm(v1 - v2))
print(np.linalg.norm(v1 - v3))
print(np.linalg.norm(v1 - v4))
print(np.linalg.norm(v2 - v3))
print(np.linalg.norm(v2 - v4))
print(np.linalg.norm(v3 - v4))

#### Using new rules

As expected, since values are chosen randomly, similar words can't be linked together and give values that are very different. 
In order to link new words with older words, a distance needs to be computed.
There are three types of word:
- **New words**, that have nothing to do with the others
    - These words need to have a random value
- **Abreviated words**
    - These words must have the same value as the word they represent.
- **Alternative words**
    - These words need not to change the barycenter of the class they belong.
    - To simplify things, alternative words will follow abreviated words' rules.
    
    
##### How to find synonyms ?

Usage of a synonym structure (hashed map) and a set of base words

![](commands-doc/synonyms.svg)

This structure should be filled by an API (synonymes.net ?)




In [6]:
base_words = set(["allumer", "éteindre", "télévision"])

synonyms_dict = {
    "allumer": [],
    "éteindre": [],
    "télévision": ["télé", "écran"],
    "télé": ["télévision"],
    "écran" : ["télévision"]
}

Checking if the word is an abreviation or a synonym of another. Basically, an abreviate is a synonym. There is no use of the first function anymore, but we'll keep it for the sake of it.

In [7]:
def abreviate(word: str) -> str:
    
    for ws in set(word_map.keys()):
        if word in ws: # if the word is a substring of a key
            return ws
    
    return None

def synonym(word: str) -> str: 
    
    if word in base_words:
        return word
    
    if word in set(synonyms_dict.keys()):
        return synonyms_dict[word][0]
    
    # Find possible synonyms here
    
    # 
    
    return None

New valuation calculation

In [8]:
def valuation(word : str) -> float:
    if word in word_map: # Word exists
        return word_map[word]
    
    # New word
    
    abrev = abreviate(word)
    syn  = synonym(word)
    
    if abrev: # The word is a synonym
        word_map[word] = word_map[abrev]
        return word_map[abrev]
        
    elif syn: # The word is an abreviation
        word_map[word] = word_map[syn]
        return word_map[syn]
        
    # the word is completely new
    word_map[word] = random.randrange(-max_range, max_range)
    return word_map[word]

print(valuation("télévision"))
print(valuation("télé"))
print(valuation("écran"))

-0.31
-0.31
-0.31


In [9]:
sentence = "Elijah, allume la télévision"
stoplist = ["elijah", "le", "la"]
words = clean_query(sentence, stoplist)


v1 = vectorize(clean_query("Elijah, allume la télévision", stoplist))
v2 = vectorize(clean_query("Elijah, éteins la télévision", stoplist))
v3 = vectorize(clean_query("Elijah, allume la télé", stoplist))
v4 = vectorize(clean_query("Elijah, éteins la télé", stoplist))

print(np.linalg.norm(v1 - v2))
print(np.linalg.norm(v1 - v3))
print(np.linalg.norm(v1 - v4))
print(np.linalg.norm(v2 - v3))
print(np.linalg.norm(v2 - v4))
print(np.linalg.norm(v3 - v4))

181.815
0.0
181.815
181.815
0.0
181.815


##### Binding an answer to a vector

In [10]:
answers = dict()

def bind_vector(sentence_vector : np.array, answers : dict, ans : str):
    answers[ans] = np.round(np.linalg.norm(sentence_vector), 2)
    
def bind_question(question: str, answers : dict, ans : str):
    vec = vectorize(clean_query(question, stoplist))
    bind_vector(vec, answers, ans)

bind_question("Allume la télévision", answers, "Ok. J'allume la télévision.")
bind_question("Éteins la télévision", answers, "Ok. J'éteins la télévision.")
bind_question("Quelle est la météo de demain ?", answers, "Voici la météo de demain : ")

print(answers)

{"Ok. J'allume la télévision.": 0.87, "Ok. J'éteins la télévision.": 181.0, 'Voici la météo de demain : ': 193.28}


##### Finding the most fitting answer. 

We are going to use a priority queue in order to sort the answers by their distance to the query vector.

In [11]:
class pqueue:
    
    def __init__(self):
        self.data = list() # (item, priority)
    
    def add(self, item, priority : float):
        l = len(self.data)
        
        if l == 0:
            self.data.insert(0, (item, priority))
            return
        
        for i in range(l):
            if self.data[i][1] > priority:
                self.data.insert(i, (item, priority))
                return
                
        self.data.insert(l, (item, priority))
        
    def __str__(self):
        s = ""
        
        if len(self.data) == 0:
            return "Empty queue"
        
        for item in self.data:
            s += str(item) + " "
        return s
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, item):
        return self.data[item]
        

In [12]:
%%script false --no-raise-error
# pqueue test
q = pqueue()
q.add("hello", 2)
q.add("hi", 4)
q.add("bonjour", -1)


print(q)

del q

In [13]:
def find_answers(query: str, answers : dict) -> (pqueue, float):
    q = pqueue()
    
    v = vectorize(clean_query(query, stoplist))
    v = np.round(np.linalg.norm(v) ,3)
    
    psum = 0
    
    for ans, barycenter in answers.items():
        priority = np.round(abs(barycenter - v), 3)
        psum += priority
        
        q.add(ans, priority)
    
    return q, psum
        
result = find_answers("Allume la télé", answers)

print(result[0])
print(result[1])

("Ok. J'allume la télévision.", 0.002) ("Ok. J'éteins la télévision.", 180.128) ('Voici la météo de demain : ', 192.408) 
372.538


##### Finding confidence level of each answer
It is an indicator to see if the value has chances to be wrong or not.