# Utils - Àlex Pujol

This notebook contains the functions and its explanations built by Àlex Pujol for the Quora Questions task for NLP subject.

In [None]:
import spacy
import random
from utils import words_count, tokenize_text, remove_accents, remove_punctuation

import nltk
#nltk.download('punkt')

In [None]:
# Example questions to test the functions
questions = [
    "I like to read books", "Reading books is enjoyable for me",
    "She runs every morning", "Every morning she goes for a run",
    "The cat is sleeping", "The sleeping cat is cute",
    "I am learning to code", "Coding is a useful skill to learn",
    "He enjoys playing video games", "Playing video games is his favorite hobby",
    "The car stopped abruptly", "The abrupt stop of the car was surprising",
    "We went to the beach", "The beach was crowded and sunny",
    "She sings beautifully", "Her beautiful singing voice is captivating",
    "The restaurant serves delicious food", "The food at the restaurant is always tasty",
    "He is studying for an exam", "Studying is important for academic success",
    "The flowers are blooming", "The blooming flowers are a sign of spring",
    "The movie was entertaining", "I found the movie to be quite enjoyable",
    "She is a talented musician", "Music is her passion and she is very talented",
    "The building is very tall", "The tall building is an impressive feat of engineering",
    "He traveled to Europe last summer", "Last summer he went on a trip to Europe",
    "I love spending time with my family", "My family is very important to me",
    "The book was very suspenseful", "I found the book to be quite thrilling",
    "She enjoys painting and drawing", "Art is her favorite form of self-expression",
    "The sun is shining brightly today", "The bright sun is making everything look beautiful",
    "He is an excellent chef", "Cooking is his passion and he is very skilled"
]

## Feature: Count Syllables

Bellow are some functions to count syllables from words and from sentences. Useful as a feature itself and to build more complex features.

In [None]:
def count_word_syllables(word):
    ''' 
    Args: 
        word (str): a tokenized word from a sentence
        
    Return:
        int: number of syllables from a word
    '''
    count = 0
    vowels = 'aeiouy'
    word = word.lower().strip(".:;?!")
    if word[0] in vowels:
        count +=1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count +=1
    if word.endswith('e'):
        count -= 1
    if word.endswith('le'):
        count+=1
    if count == 0:
        count +=1
    return int(count)

def count_sentence_syllables(doc):
    '''
    Args: 
        doc (str): a raw sentence
        
    Return:
        int: number of syllables of the entire sentence
    '''
    count = 0
    for w in tokenize_text(remove_accents(remove_punctuation(doc))):
        count += count_word_syllables(w)
    return int(count)

In [None]:
# Examples
ex = questions[random.randint(0,len(questions)-1)]
print("Example sentence: \n=>",ex)
print()
print("Syllables for each word: \n=>", [count_word_syllables(w) for w in tokenize_text(remove_accents(remove_punctuation(ex)))])
print()
print("Total amount of syllables in the sentence: \n=>", count_sentence_syllables(ex))

## Feature: Readibility metrics

### Flesch–Kincaid readability tests

The **Flesch–Kincaid readability tests** are readability tests designed to indicate how difficult a passage in English is to understand. There are two tests: the Flesch Reading-Ease, and the Flesch–Kincaid Grade Level. Although they use the same core measures (word length and sentence length), they have different weighting factors. 
- Flesch Reading-Ease: Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.
- Flesch–Kincaid grade level: Presents a score as a U.S. grade level, making it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. It can also mean the number of years of education generally required to understand this text.

In [None]:
# Flesch Reading-Ease
def Flesch_Reading_Ease(doc, a = 206.835, b = 1.015, c = 84.6):
    '''
    Args:
        doc (str): sentence to analize
        a (float): Flesch Reading-Ease parameter
        b (float): Flesch Reading-Ease parameter
        c (float): Flesch Reading-Ease parameter
    
    Return:
        str: Computes the Flesch Reading-Ease score of the sentence
    '''
    return a - b * (words_count(doc) / 1) - c * (count_sentence_syllables(doc) / words_count(doc))


# Flesch-Kincaid Grade Level
def Flesch_Grade_Level(doc, a = 0.39, b = 11.8, c = 15.59):
    '''
    Args:
        doc (str): sentence to analize
        a (float): Flesch-Kincaid Grade Level parameter
        b (float): Flesch-Kincaid Grade Level parameter
        c (float): Flesch-Kincaid Grade Level parameter
    
    Return:
        str: Computes the Flesch-Kincaid Grade Level score of the sentence
    '''
    return a * (words_count(doc) / 1) + b * (count_sentence_syllables(doc) / words_count(doc)) - c
    
    

In [None]:
# Examples
ex = questions[random.randint(0,len(questions)-1)]
print("Example sentence: \n=>",ex)
print()
print("Flesch Reading-Ease score: \n=>", Flesch_Reading_Ease(ex))
print()
print("Flesch-Kincaid Grade Level score: \n=>", Flesch_Grade_Level(ex))
print()

## Feature: Linguistic Features
We make use of spaCy library to retrive different linguistic annotations of each question. 

In [None]:
class Linguistics():
    '''
    Makes use of scapy library to extract linguistic features form sentences to either use them as features themselves or to build more complex features.
    '''
    def __init__(self, doc):
        '''
        Args:
            doc (str): sentence to analyze
        '''
        nlp = spacy.load("en_core_web_sm")
        self.doc = doc
        self.tokens = nlp(doc)
    
    def text(self):
        '''
        Tokenizes the sentence
        '''
        return [token.text for token in self.tokens]
    
    def lemma(self):
        '''
        Lemmatizes the sentence
        '''
        return [token.lemma_ for token in self.tokens]
    
    def pos(self):
        '''
        Applies simple Part-Of-Speech tagging to the sentence
        '''
        return [token.pos_ for token in self.tokens]
    
    def tag(self):
        '''
        Applies detailed Part-Of-Speech tagging to the sentence
        '''
        return [token.tag_ for token in self.tokens]
    
    def dep(self):
        '''
        Applies the syntactic dependencey between tokens in the sentence
        '''
        return [token.dep_ for token in self.tokens]
    
    def shape(self):
        '''
        Applies tagging to words according to their shape
        '''
        return [token.shape_ for token in self.tokens]
    
    def is_alpha(self):
        '''
        Applies tagging according for word being an Alpha token or not
        '''
        return [token.is_alpha for token in self.tokens]
    
    def is_stop(self):
        '''
        Applies tagging according for word being a stopword or not
        '''
        return [token.is_stop for token in self.tokens]

## Functionality: TF-IDF implementation in cython
The method `tf_idf.compute_tf_idf(str docs)` computes the tf-idf of docs and returns each document as a normalized sparse vector where each element of the vector is`(str word, double value)`.

In [None]:
from cython_utils import tf_idf

In [None]:
# Examples

import time
start_time = time.time()
tf_idfs = tf_idf.compute_tf_idf(questions)
print("---Execution time: %s seconds ---" % (time.time() - start_time))

for i, doc_tf_idfs in enumerate(tf_idfs[:2]):
    print(f"Document {i}:")
    for term, tf_idf in doc_tf_idfs:
        print(f"\t{term}: {tf_idf}")