In [1]:
import scipy
import numpy as np
import re

# Simple model functions

We have created the simple model following the preparation notebook `delivery_1_quora.ipynb`. So, we have included in our `utils.py` the functions from that notebook, needed to create the count vectorizer model (including the casting preprocessing and the mistakes testing). 

`cast_list_as_strings` is explained in the following section (Preprocessing functions), because we also use it with the improved model. 

# Preprocess functions

All these preprocess functions are meant to receive a list of strings (representing sentences), and apply some transformation to each string/sentence.

We use them to preprocess one question field. So, to consider both question1 and question2 (of training, validation or test sets), we have to call the functions many times.

In order to show their usage, we are going to consider two rows of our training set:

In [2]:
questions1 = ["How do you convert direct speech into reported speech and vice versa including all cases?", "Where can I buy used wine barrels?"]
questions2 = ["I feel weak at spoken English. I have sentences ready in my mind, but I can't speak it. What should I do?", "Where can you buy used wine barrels?"]

## cast_list_as_strings

In [3]:
def cast_list_as_strings(mylist):
    mylist_of_strings = []
    for x in mylist:
        mylist_of_strings.append(str(x))
    return mylist_of_strings

Cast each string/sentence of the list to a string. This is done to solve the problem of having another type of data, as floats, for example.

In [4]:
questions1_casted = cast_list_as_strings(questions1)
questions2_casted = cast_list_as_strings(questions2)
print("Questions 1:",questions1_casted)
print("Questions 2:",questions2_casted)

Questions 1: ['How do you convert direct speech into reported speech and vice versa including all cases?', 'Where can I buy used wine barrels?']
Questions 2: ["I feel weak at spoken English. I have sentences ready in my mind, but I can't speak it. What should I do?", 'Where can you buy used wine barrels?']


## lower_list

In [5]:
def lower_list(mylist):
    list_lowered = []
    for string in mylist:
        list_lowered.append(string.lower())
    return list_lowered

Lows each string/sentence of the list, in order to not have differences between lower and upper case letters.

In [6]:
questions1_lw = lower_list(questions1_casted)
questions2_lw = lower_list(questions2_casted)
print("Questions 1:",questions1_lw)
print("Questions 2:",questions2_lw)

Questions 1: ['how do you convert direct speech into reported speech and vice versa including all cases?', 'where can i buy used wine barrels?']
Questions 2: ["i feel weak at spoken english. i have sentences ready in my mind, but i can't speak it. what should i do?", 'where can you buy used wine barrels?']


## remove_sw

In [7]:
def remove_sw(mylist,stop_words):
    list_without_sw = []
    for string in mylist:
        # Pattern to match stop words
        pattern = re.compile(r'\b(' + '|'.join(stop_words) + r')\b')
        # Remove the stop words using the regular expression pattern
        list_without_sw.append(pattern.sub('',string))
    return list_without_sw

In [8]:
stop_words = ["a", "an", "the", "and", "but", "or", "in", "on", "at", "to", "of", "for","i","you","he","she","it","we","they",
             "me","him","her","us","them","my","your","his","its","our","their","mine","yours","hers","ours","theirs","myself",
              "yourself","himself","herself","itself","ourselves","yourselves","themselves","this","that","these","those"]

Given a list of stop words, the regular expression pattern allows us to remove all that words for each string/sentence of the list. These words don't carry much meaning and can make two questions more different.

In [9]:
questions1_sw = remove_sw(questions1_lw,stop_words)
questions2_sw = remove_sw(questions2_lw,stop_words)
print("Questions 1:",questions1_sw)
print("Questions 2:",questions2_sw)

Questions 1: ['how do  convert direct speech into reported speech  vice versa including all cases?', 'where can  buy used wine barrels?']
Questions 2: [" feel weak  spoken english.  have sentences ready   mind,   can't speak . what should  do?", 'where can  buy used wine barrels?']


## tokenize

In [10]:
def tokenize(mylist):
    list_tokenized = []
    for string in mylist:
        # Regular expression pattern to match words
        pattern = re.compile(r"\w+")
        tokens = pattern.findall(string)
        list_tokenized.append(tokens)
    return list_tokenized

Obtains words (a list of strings) for each string/sentence of the list. The regular expression pattern used also considers numbers as words. Although the similarity between near numbers cannot be captured, we think that considering numbers in this "questions comparison" scenario could make sense because two equivalent questions should ask about the exact same number.

In [11]:
questions1_tokens = tokenize(questions1_sw)
questions2_tokens = tokenize(questions2_sw)
print("Questions 1:",questions1_tokens)
print("Questions 2:",questions2_tokens)

Questions 1: [['how', 'do', 'convert', 'direct', 'speech', 'into', 'reported', 'speech', 'vice', 'versa', 'including', 'all', 'cases'], ['where', 'can', 'buy', 'used', 'wine', 'barrels']]
Questions 2: [['feel', 'weak', 'spoken', 'english', 'have', 'sentences', 'ready', 'mind', 'can', 't', 'speak', 'what', 'should', 'do'], ['where', 'can', 'buy', 'used', 'wine', 'barrels']]


# Jaccard distance functions

As a feature for the improved model, we consider the jaccard distance at sentence level. We implement it and obtain the feature with the following functions:

## jaccard_similarity

In [12]:
def jaccard_similarity(sent1,sent2):
    return len(sent1.intersection(sent2)) / len(sent1.union(sent2))

Calculates the jaccard similarity between the tokens of two sentences. The number of elements of the intersection divided by the number of elements of the union.

## jaccard_distance

In [13]:
def jaccard_distance(sent1,sent2):
    return 1-jaccard_similarity(sent1,sent2)

It substracts the jaccard similarity to 1 in order to obtain the jaccard distance.

## generate_jd_feature

In [14]:
def generate_jd_feature(q1_tokens,q2_tokens):
    jd_feature = []
    for i in range(len(q1_tokens)):
        jd = jaccard_distance(set(q1_tokens[i]),set(q2_tokens[i]))
        jd_feature.append(jd)
    return jd_feature

Given a list of tokenized sentences for both question 1 and 2, this function calculates the jaccard distance between each pair of sentences. It will be the jaccard distance feature that we use in our improved model. To show their usage, we apply it to the previously tokenized sentences:

In [15]:
print("Questions 1:",questions1_tokens)
print("Questions 2:",questions2_tokens)

Questions 1: [['how', 'do', 'convert', 'direct', 'speech', 'into', 'reported', 'speech', 'vice', 'versa', 'including', 'all', 'cases'], ['where', 'can', 'buy', 'used', 'wine', 'barrels']]
Questions 2: [['feel', 'weak', 'spoken', 'english', 'have', 'sentences', 'ready', 'mind', 'can', 't', 'speak', 'what', 'should', 'do'], ['where', 'can', 'buy', 'used', 'wine', 'barrels']]


In [16]:
generate_jd_feature(questions1_tokens,questions2_tokens)

[0.96, 0.0]

After the preprocess and tokenization, the second pair of questions has become equal. For this reason, the jaccard distance is 0. But, in the first pair, there are almost any coincidence, so the distance is high. And indeed, the target feature of the second pair indicates that they are duplicate, while the target of the first pair indicates the opposite.

# Evaluation functions

## calculate_metrics

In [17]:
def calculate_metrics(y,X,model):
    model_metrics = []
    model_metrics.append(metrics.roc_auc_score(y,model.predict(X)))
    model_metrics.append(metrics.accuracy_score(y,model.predict(X)))
    model_metrics.append(metrics.precision_score(y,model.predict(X)))
    model_metrics.append(metrics.recall_score(y,model.predict(X)))
    model_metrics.append(metrics.f1_score(y,model.predict(X)))
    return model_metrics

Given a trained model and the features obtained from one of the datasets (train, validation, test), it calculates the roc auc, accuracy, precision, recall and f1 score. 