# Word comparing functions

As features for the improved model, we considered the comparation of words at two levels: checking if they have the same word in the same position (checking all tokens with the same index), and also checking the key words (what, where, when, who, why...) of a question and looking if they share the first keyword. We implement it and obtain the following features with their respective functions:

First, we have the same_words_ordered, which check all tokens on the shorter tokenized question against the tokens of the other comparing the words with the same index. After that, we divide by the length of the longer tokenized question, in order not to ignore the part of the longer question that is not compared.

In [1]:
def same_words_ordered(q1_tokens,q2_tokens):
    n = min(len(q1_tokens), len(q2_tokens))
    m = max(len(q1_tokens), len(q2_tokens))
    same = 0
    for i in range(n):
        if q1_tokens[i] == q2_tokens[i]:
            same += 1

    return same / m

def test_same_words_ordered():
    # Test case 1: Same question
    q1_tokens = ['what', 'is', 'the', 'capital', 'of', 'France']
    q2_tokens = ['what', 'is', 'the', 'capital', 'of', 'France']
    assert same_words_ordered(q1_tokens, q2_tokens) == 1.0

    # Test case 2: Different questions with just a f common words
    q1_tokens = ['what', 'do', 'you', 'like', 'about', 'France']
    q2_tokens = ['who', 'won', 'the', 'World', 'Cup', 'last', 'year']
    assert same_words_ordered(q1_tokens, q2_tokens) == 0.0

    # Test case 3: Questions with different lengths
    q1_tokens = ['what', 'is', 'the', 'capital', 'of', 'France']
    q2_tokens = ['what', 'is', 'the', 'capital', 'city', 'of', 'France']
    assert same_words_ordered(q1_tokens, q2_tokens) == 4.0 / 7.0

        # Test case 4: Different questions with some common words
    q1_tokens = ['what', 'is', 'the', 'capital', 'of', 'France']
    q2_tokens = ['what', 'is', 'the', 'capital', 'of', 'Spain']
    assert same_words_ordered(q1_tokens, q2_tokens) == 5.0 / 6.0


test_same_words_ordered()

In [2]:
def generate_ordered_words_feature(q1_tokens,q2_tokens):
    ow_feature = []
    for i in range(len(q1_tokens)):
        ow = same_words_ordered(q1_tokens[i],q2_tokens[i])
        ow_feature.append(ow)
    return ow_feature

q1_tokens = [['what', 'is', 'the', 'capital', 'of', 'France'], ['what', 'do', 'you', 'like', 'about', 'France'], ['what', 'is', 'the', 'capital', 'of', 'France']]
q2_tokens = [['what', 'is', 'the', 'capital', 'of', 'France'], ['who', 'won', 'the', 'World', 'Cup', 'last', 'year'], ['what', 'is', 'the', 'capital', 'city', 'of', 'France']]

generate_ordered_words_feature(q1_tokens, q2_tokens)

[1.0, 0.0, 0.5714285714285714]

Next, we have the functions to check the keywords. This function is a quick way to check a commonality between some questions: the type of question is asked. We search for the first keyword on each question and then we check if they are the same type of keyword. Some keywords are grouped because they can be interchanged without changing the meaning of the question.

In [3]:
def search_key_word(q_tokens):
    for token in q_tokens:
        if token in ['where', 'when', 'who', 'do', 'should']:
            return token
        elif token in ['can', 'how', 'could']:
            return 'can'
        elif token in ['what', 'which']:
            return 'what'
        # Hard-coded case
        elif token in ['why', 'whey']:
            return 'why'
        
    return None


def test_search_key_word():
    q_tokens = ['where', 'is', 'the', 'nearest', 'gas', 'station']
    assert search_key_word(q_tokens) == 'where'

    q_tokens = ['hey', 'should', 'I', 'wear', 'a', 'jacket']
    assert search_key_word(q_tokens) == 'should'

    q_tokens = ['who', 'is', 'the', 'president', 'of', 'the', 'USA']
    assert search_key_word(q_tokens) == 'who'

    q_tokens = ['why', 'is', 'the', 'sky', 'blue']
    assert search_key_word(q_tokens) == 'why'

    q_tokens = ['This', 'is', 'a', 'test', 'question']
    assert search_key_word(q_tokens) is None


test_search_key_word()

In [4]:
def compare_key_word(q1_tokens, q2_tokens):
    key_word1 = search_key_word(q1_tokens)
    key_word2 = search_key_word(q2_tokens)
    if key_word1 and key_word1 == key_word2:
        return 1
    else:
        return 0
    

def test_compare_key_word():
    assert compare_key_word(['what', 'is', 'your', 'name'], ['what', 'is', 'my', 'name']) == 1
    assert compare_key_word(['where', 'are', 'you', 'from'], ['what', 'is', 'your', 'name']) == 0
    assert compare_key_word(['can', 'you', 'help', 'me'], ['how', 'can', 'I', 'help', 'you']) == 1
    assert compare_key_word(['why', 'is', 'the', 'sky', 'blue'], ['what', 'is', 'your', 'name']) == 0
    assert compare_key_word(['where', 'can', 'I', 'buy', 'tickets'], ['where', 'should', 'I', 'buy', 'tickets']) == 1
    assert compare_key_word(['what', 'is', 'the', 'capital', 'of', 'Spain'], ['which', 'city', 'is', 'the', 'capital', 'of', 'Spain']) == 1
    assert compare_key_word(['can', 'I', 'pay', 'with', 'cash'], ['can', 'I', 'pay', 'with', 'credit', 'card']) == 1
    assert compare_key_word(['why', 'do', 'birds', 'migrate'], ['how', 'do', 'birds', 'fly']) == 0


In [5]:
def generate_key_words_feature(q1_tokens, q2_tokens):
    kw_feature = []
    for i in range(len(q1_tokens)):
        kw = compare_key_word(q1_tokens[i],q2_tokens[i])
        kw_feature.append(kw)
    return kw_feature

def test_generate_key_words_feature():
    q1_tokens = [['what', 'is', 'your', 'name'], ['where', 'are', 'you', 'from'], ['can', 'you', 'help', 'me'], ['why', 'is', 'the', 'sky', 'blue']]
    q2_tokens = [['what', 'is', 'my', 'name'], ['what', 'is', 'my', 'name'], ['how', 'can', 'I', 'help', 'you'], ['what', 'is', 'my', 'name']]
    assert generate_key_words_feature(q1_tokens, q2_tokens) == [1, 0, 1, 0]

# Discarded functions

We also had another features in mind that we ended up discarding because they were taking a long time to create the features for all the datasets. One example of this was the 'negation' feature, which checks if a question has negation present on it and if two question share this property or not.

It took one hour to generate the train features. Therefore, we discarded this feature, and thereby we abandoned other feature ideas that needed the same spacy model, such as grammatical struture, lemmatization and named entities. We also discarded synonyms, because of the same reasons (too much computation time to get features).

In [None]:
!python -m spacy download en_core_web_sm

In [6]:
import spacy

def negation(sent1, sent2, nlp):
    doc1 = nlp(sent1)
    doc2 = nlp(sent2)

    # Check if the negation is the same
    if (not any(token.dep_ == 'neg' for token in doc1)) == (not any(token.dep_ == 'neg' for token in doc2)):
        return 1
    else:
        return 0
    
def generate_negation_feature(q1, q2):
    nlp = spacy.load('en_core_web_sm')

    negation_feature = []
    for i in range(len(q1)):
        neg = negation(q1[i],q2[i],nlp)
        negation_feature.append(neg)
    return negation_feature

def test_negation():
    nlp = spacy.load('en_core_web_sm')
    
    assert negation("Did you eat lunch?", "You didn't eat lunch?", nlp) == 1
    assert negation("Did you eat lunch?", "You ate lunch?", nlp) == 0
    assert negation("Are you happy?", "You're not happy?", nlp) == 1
    assert negation("Are you happy?", "You're happy?", nlp) == 0