# Experimentación sobre extracción de definición y definendum

## Datasets disponibles para experimentación

En el siguiente repositorio público de github se encuentran los dataset utilizados en esta experimentación:

*   https://github.com/sebastianvolti/pln-me-datasets



## Análisis manual de definiciones y pistas para crucigramas

En una primer etapa, realizamos el análisis manual de posibles definiciones y pistas para armar un crucigrama.

Para realizar esta tarea, nos basamos en textos simples, extraídos de dos fuentes distintas de datos.

*   Textos extraídos de https://lingua.com/english/reading/.
*   Textos extraídos de dataset **“readworks”**

Se obtuvieron 20 ejemplos manuales para cada fuente de datos, detallados en el siguiente documento:

*   https://drive.google.com/file/d/1jplE-uB8wkzbQ-uTHtRtZ2CaCBCBgZ0D/view?usp=sharing


Ademas de las 2 fuentes de textos mencionadas anteriormente, se logró recolectar otro dataset formado por 96 textos de nivel básico de inglés, extraídos desde el siguiente sitio web:

*  https://www.eslfast.com/kidsenglish/




### Análisis manual de algunos textos

#### Posibles tags (Part of speech):

* **CD** cardinal digit
* **DT** determiner
* **EX** existential there (like: “there is” … think of it like “there exists”)
* **FW** foreign word
* **IN** preposition/subordinating conjunction
* **JJ** adjective ‘big’
* **JJR** adjective, comparative ‘bigger’
* **JJS** adjective, superlative ‘biggest’
* **LS** list marker 1)
* **MD** modal could, will
* **NN** noun, singular ‘desk’
* **NNS** noun plural ‘desks’
* **NNP** proper noun, singular ‘Harrison’
* **NNPS** proper noun, plural ‘Americans’
* **PDT** predeterminer ‘all the kids’
* **POS** possessive ending parent’s
* **PRP** personal pronoun I, he, she
* **PRP** (con signo pesos) possessive pronoun my, his, hers
* **RB** adverb very, silently,
* **RBR** adverb, comparative better
* **RBS** adverb, superlative best
* **RP** particle give up
* **TO**, to go ‘to’ the store.
* **UH** interjection, errrrrrrrm
* **VB** verb, base form take
* **VBD** verb, past tense took
* **VBG** verb, gerund/present participle taking
* **VBN** verb, past participle taken
* **VBP** verb, sing. present, non-3d take
* **VBZ** verb, 3rd person sing. present takes
* **WDT** wh-determiner which
* **WP** wh-pronoun who, what
* **WP$** possessive wh-pronoun whose
* **WRB** wh-abverb where, when

#### Oraciones analizadas, **pos tagging**:

* ('People', 'NNS'), ('use', 'VBP'), **`('money', 'NN')`**, ('to', 'TO'), ('buy', 'VB'), ('things', 'NNS'), ('.', '.').
* ('A', 'DT'), **`('bank', 'NN')`**, ('is', 'VBZ'), ('a', 'DT'), ('place', 'NN'), ('that', 'WDT'), ('keeps', 'VBZ'), ('money', 'NN'), ('safe', 'JJ'), ('.', '.')
* ('Everything', 'VBG'), ('about', 'IN'), ('an', 'DT'), **`('elephant', 'NN')`**, ('is', 'VBZ'), ('big', 'JJ'), ('.', '.'), ('It', 'PRP'), ('has', 'VBZ'), ('big', 'JJ'), ('ears', 'NNS'), ('.', '.')
* ('An', 'DT'), ('elephant', 'NN'), ('also', 'RB'), ('has', 'VBZ'), ('a', 'DT'), ('long', 'JJ'), **`('trunk', 'NN')`**, ('.', '.'), ('It', 'PRP'), ('uses', 'VBZ'), ('its', 'PRP$'), ('trunk', 'NN'), ('to', 'TO'), ('breathe', 'VB'), ('and', 'CC'), ('to', 'TO'), ('smell', 'VB'), ('.', '.')
*  **`('Lightning', 'VBG')`**, ('is', 'VBZ'), ('electricity', 'NN'), ('.', '.'), ('It', 'PRP'), ('forms', 'VBZ'), ('in', 'IN'), ('clouds', 'NN'), ('during', 'IN'), ('a', 'DT'), ('storm', 'NN'), ('.', '.')
* ('A', 'DT'), **`('desert', 'NN')`**, ('is', 'VBZ'), ('a', 'DT'), ('dry', 'JJ'), ('place', 'NN'), ('.', '.'), ('Very', 'RB'), ('little', 'JJ'), ('rain', 'NN'), ('falls', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('desert', 'NN'), ('.', '.')
* ('Deserts', 'NNS'), ('are', 'VBP'), ('dry', 'JJ'), (',', ','), ('but', 'CC'), ('plants', 'NNS'), ('and', 'CC'), **`('animals', 'NNS')`**, ('find', 'VBP'), ('ways', 'NNS'), ('to', 'TO'), ('live', 'VB'), ('there', 'RB'), ('.', '.')
* ('Deserts', 'NNS'), ('are', 'VBP'), ('dry', 'JJ'), (',', ','), ('but', 'CC'), **`('plants', 'NNS')`**, ('and', 'CC'), ('animals', 'NNS'), ('find', 'VBP'), ('ways', 'NNS'), ('to', 'TO'), ('live', 'VB'), ('there', 'RB'), ('.', '.')
* **`('Cactus', 'NN')`**, ('plants', 'NNS'), ('grow', 'VB'), ('in', 'IN'), ('deserts', 'NNS'), ('.', '.')
* **`('Camels', 'NNP')`**, ('live', 'VBP'), ('in', 'IN'), ('deserts', 'NNS'), ('.', '.')

#### Oraciones analizadas, **spicy named entities**:

* People use money to buy things. 
  * **spicy**: Nothing.
* A bank is a place that keeps money safe.
  * **spicy**: Nothing.
* Everything about an elephant is big. It has big ears.
  * **spicy**: Nothing.
* An elephant also has a long trunk. It uses its trunk to breathe and to smell.
  * **spicy**: Nothing.
* Lightning is electricity. It forms in clouds during a storm. 
  * **spicy**: ('Lightning', 'PERSON').
* A desert is a dry place. Very little rain falls in the desert.
  * **spicy**: Nothing.
* Deserts are dry, but plants and animals find ways to live there. 
  * **spicy**: Nothing.
* Cactus plants grow in deserts.
  * **spicy**: Nothing.
* Camels live in deserts.
  * **spicy**: Nothing.

## Implementación

Realizaremos a continuación algunos experimentos con los textos provistos por el dataset **“readworks”**, formado por 368 textos en inglés de nivel básico.

### Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import os
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


### Functions

In [None]:
def read_dataset_files(dataset_path):
    all_texts = []
    filenames = []
    for filename in os.listdir(dataset_path):
        filenames.append(filename)
        file_path = dataset_path + filename
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
                all_texts.append((text))
    return all_texts, filenames

def extract(all_texts):
    all_text = "".join([t for t in all_texts])
    all_tokens = nltk.word_tokenize(all_text)
    return all_texts, all_tokens

In [None]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def read_dataset_files_sentences(dataset_path):
  all_texts_sentences = []
  filenames = []
  for filename in os.listdir(dataset_path):
      filenames.append(filename)
      file_path = dataset_path + filename
      if os.path.isfile(file_path):
          with open(file_path, 'r', encoding='utf-8') as file:
              text = file.read()
              sentences = tokenizer.tokenize(text)
              all_texts_sentences.append((sentences))
  return all_texts_sentences, filenames

In [None]:
import numpy as np

def get_pos_features(texts):
  texts_pos = [[p for p in nltk.pos_tag(nltk.word_tokenize(text))] for text in texts]
  return texts_pos

def get_pos_features_sentences(texts):
  texts_pos = []
  for text in texts:
    texts_pos_sentence = [[p for p in nltk.pos_tag(nltk.word_tokenize(sentence))] for sentence in text]
    texts_pos.append(texts_pos_sentence)
  return texts_pos


In [None]:
def separate_tuples(sent):
  lists = list(map(list, zip(*sent)))
  tokens = lists[0]
  pos = lists[1]
  return tokens, pos

In [None]:
def get_text_named_entities_nltk(texts_pos):
  texts_entities = []
  pattern = 'NP: {<DT>?<JJ>*<NN>}'
  cp = nltk.RegexpParser(pattern)
  for sentence in texts_pos:
    cs = cp.parse(sentence)
    texts_entities.append(cs)
  return texts_entities

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

def get_text_named_entities_spicy(all_texts):
  texts_entities = []
  for text in all_texts:
    doc = nlp(text)
    texts_entities.append([(X.text, X.label_) for X in doc.ents])
  return texts_entities

In [None]:
def parser_nltk(tags):
    chunkPattern = '\n'.join([
      'NP: {<JJ>*<NN>}',
      'NP: {<NNP>+}',
    ]) 
    chunkParser = nltk.RegexpParser(chunkPattern)
    chunkedData = chunkParser.parse(tags)

In [None]:
def parser_spicy(tags):
    chunkPattern = '\n'.join([
    'NP: {<MONEY>+}',
    ]) 
    chunkParser = nltk.RegexpParser(chunkPattern)
    chunkedData = chunkParser.parse(tags)
    return chunkedData

### Patterns Functions

#### Utils

In [None]:
def generate_index_list(elem_list, elem_value):
  index = 0
  index_list = []
  for elem in elem_list:
    if (elem == elem_value):
      index_list.append(index)
    index+=1
  return index_list

In [None]:
def generate_index_list_v2(elem_list, elem_value_list):
  index = 0
  index_list = []
  for elem in elem_list:
    if (elem in elem_value_list):
      index_list.append(index)
    index+=1
  return index_list

In [None]:
list_of_professions = ['accountant','actor','actress','air traffic controller','architect','artist','attorney','banker','bartender','barber','bookkeeper','builder','businessman','businesswoman','businessperson','butcher','carpenter','cashier','chef','coach','dental hygienist','dentist','designer','developer','dietician','doctor','economist','editor','electrician','engineer','farmer','filmmaker','fisherman','flight attendant','jeweler','judge','lawyer','mechanic','musician','nutritionist','nurse','notary','optician','painter','pharmacist','photographer','physician','pilot','plumber','police officer','politician','professor','programmer','psychologist','receptionist','salesman','salesperson','saleswoman','secretary','singer','surgeon','teacher','therapist','translator','translator','undertaker','veterinarian','videographer','waiter','waitress','writer']

def professions(token_list):
  profs = []
  for prof in list_of_professions:
    if prof in token_list:
      profs.append(prof)
  return profs

In [None]:
def is_np(pos_list, nn):
  if (nn > 0 and pos_list[nn-1] in ['PRP', 'PRP$']):
    return True
  else:
    return False

In [None]:
def contains_nn(pos_list):
  for pos in pos_list:
    if pos in ['NN','NNS','NNP','NNPS']:
      return True
  return False

In [None]:
def contains_np(pos_list):
  index = 0
  for pos in pos_list:
    if pos in ['NN','NNS','NNP','NNPS']:
      if index > 0 and pos_list[index-1] in ['PRP', 'PRP$']:
        return True
    index+=1
  return False

In [None]:
def contains_verb(pos_list):
   for pos in pos_list:
    if pos in ['VB','VBD','VBG','VBN', 'VBP', 'VBZ']:
      return True
   return False

In [None]:
def contains_nnp(pos_list):
   for pos in pos_list:
    if pos in ['NNP', 'NNPS']:
      return True
   return False

In [None]:
def generate_np_index(pos_list, nn_list):
  np_list = []
  for nn in nn_list:
    if is_np(pos_list, nn):
      np_list.append(nn)
  return np_list 

#### Patterns


NP is a [list of professions] → NP is a **XXX**.

In [None]:
#PATTERN: PRP NN -> IS/VBZ -> [list_of_professions]
def possible_combinations_p1(token_list, pos_list, nn_index, vbz_index, prof_list):
  clue_list = []
  goal_list = []
  for nn in nn_index:
    for vbz in vbz_index:
      for prof in prof_list:
        if nn < vbz and vbz < token_list.index(prof):
          control_nn = [pos_list[elem] for elem in range(nn+1,vbz)]
          if (not contains_nn(control_nn)):
            clue = token_list[nn-1] + ' ' +  token_list[nn] + ' ' + token_list[vbz] + ' a ...'
            goal = prof 
            clue_list.append(clue)
            goal_list.append(goal)
    
  return clue_list, goal_list


def pattern_1(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  prof_list = professions(token_list)
  if len(prof_list) > 0 and contains_nn(pos_list) and 'VBZ' in pos_list:
    #nn_index = generate_index_list(pos_list, 'NN')
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_np_index(pos_list, nn_index)
    vbz_index = generate_index_list(pos_list, 'VBZ')
    return possible_combinations_p1(token_list, pos_list, np_index, vbz_index, prof_list) 
  return clue_list, goal_list


NP1 is a [list of professions], NP1 works at NP2 → **XXX** works at NP2, NP1 works at **XXX**.

In [None]:
#PATTERN: PRP NN1 -> IS/VBZ -> [list_of_professions], PRP NN1 -> IN -> NN2
def possible_combinations_p2(token_list, pos_list, nn_index, vbz_index, in_index, prof_list):
  clue_list = []
  goal_list = []
  for nn1 in nn_index:
    for nn2 in nn_index:
      if nn1 != nn2 and nn1 < nn2:
        for vbz in vbz_index:
          for in_i in in_index:
            for prof in prof_list:       
              if nn1 < vbz and is_np(pos_list, nn1) and vbz < token_list.index(prof) and vbz < in_i and token_list.index(prof) < nn2:
                  control_nn = [pos_list[elem] for elem in range(nn1+1,in_i)]
                  if (not contains_np(control_nn)):
                    clue = token_list[nn1-1] + ' ' + token_list[nn1] + ' works ' + token_list[in_i] + ' ...'
                    goal = token_list[nn2] 
                    clue_list.append(clue)
                    goal_list.append(goal)
                    clue = '... works ' + token_list[in_i] + ' ' + token_list[nn2]
                    goal = token_list[nn1-1] + ' ' + token_list[nn1]  
                    clue_list.append(clue)
                    goal_list.append(goal)
  return clue_list, goal_list

def pattern_2(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  prof_list = professions(token_list)
  if len(prof_list) > 0 and contains_nn(pos_list) and 'IN' in pos_list and 'VBZ' in pos_list:
    #nn_index = generate_index_list(pos_list, 'NN')
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    vbz_index = generate_index_list(pos_list, 'VBZ')
    in_index = generate_index_list(pos_list, 'IN')
    if len(nn_index) > 1:
      return possible_combinations_p2(token_list, pos_list, nn_index, vbz_index, in_index, prof_list) 
  return clue_list, goal_list

NP is XXX, PRONOUN is YYY → NP is **…YYY…**

In [None]:
#PATTERN: PRP NN1 -> IS/VBZ1 -> NN2, PRP -> IS/VBZ2 -> NN3
def possible_combinations_p3(token_list, pos_list, nn_index, vbz_index, prp_index):
  clue_list = []
  goal_list = []
  for nn1 in nn_index:
    for nn2 in nn_index:
      for nn3 in nn_index:
        if nn1 != nn2 != nn3 and nn1 < nn2 < nn3:
          for vbz1 in vbz_index:
            for vbz2 in vbz_index:
              if vbz1 != vbz2 and vbz1 < vbz2:
                for prp in prp_index:
                  if nn1 < vbz1 and is_np(pos_list, nn1) and vbz1 < nn2 and nn2 < prp and prp < vbz2 and vbz2 < nn3:
                    clue = token_list[nn1-1] + ' ' + token_list[nn1] + ' ' + token_list[vbz1] + ' ...'
                    goal = token_list[nn3] 
                    clue_list.append(clue)
                    goal_list.append(goal)
  return clue_list, goal_list

def pattern_3(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and 'VBZ' in pos_list and 'PRP' in pos_list:
    #nn_index = generate_index_list(pos_list, 'NN')
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    vbz_index = generate_index_list(pos_list, 'VBZ')
    prp_index = generate_index_list(pos_list, 'PRP')
    if len(nn_index) > 2 and len(vbz_index) > 1:
      return possible_combinations_p3(token_list, pos_list, nn_index, vbz_index, prp_index) 
  return clue_list, goal_list

NP VERB XXX and YYY → NP VERB …XXX…, NP VERB …YYY…, not VERB in …XXX…, …YYY…

In [None]:
#PATTERN: PRP NN1/NNS1 VERB XXX CC YYY -> PRP NN1/NNS1 VERB ...YYY..., not VERB in …XXX…, …YYY…
def possible_combinations_p4(token_list, pos_list, np_index, verb_index, cc_index):
  clue_list = []
  goal_list = []
  for np in np_index:
    for verb in verb_index:
      for cc in cc_index:
        if ((np < verb < cc) and (cc - verb > 0)):
          xxx_pos = [pos_list[elem] for elem in range(verb+1,cc)]
          yyy_pos = [pos_list[elem] for elem in range(cc+1,len(token_list))]

          xxx = [token_list[elem] for elem in range(verb+1,cc)]
          yyy = [token_list[elem] for elem in range(cc+1,len(token_list)-1)]
       
          if (not contains_verb(xxx_pos) and not contains_verb(yyy_pos) and not contains_nnp(xxx_pos) and not contains_nnp(yyy_pos)):
            clue = token_list[np-1] + ' ' + token_list[np] + ' ' + token_list[verb] + ' ...'
            goal = ""
            for elem in yyy:
              goal = goal + elem + " "
            clue_list.append(clue)
            goal_list.append(goal)

           
            clue_inverse = ""
            for elem in yyy:
              clue_inverse = clue_inverse + elem + " "

            clue = token_list[np-1] + ' ' + token_list[np] + ' ' + token_list[verb] + ' ' + clue_inverse + token_list[cc] + ' ...'
            goal = ""
            for elem in xxx:
              goal = goal + elem + " "

            clue_list.append(clue)
            goal_list.append(goal)

      

  return clue_list, goal_list

def pattern_4(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and contains_verb(pos_list) and 'CC' in pos_list:
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_np_index(pos_list, nn_index)
    verb_index = generate_index_list_v2(pos_list, ['VB','VBD','VBG','VBN', 'VBP', 'VBZ'])
    cc_index = generate_index_list(pos_list, 'CC')
    return possible_combinations_p4(token_list, pos_list, np_index, verb_index, cc_index) 
  return clue_list, goal_list

 NP1 is called NP2 →  NP2 is **…XXX…**, NP1 is **…XXX…**

In [None]:
#PATTERN: NN VBZ/VBP called NNP/NNPS
def possible_combinations_p5(token_list, pos_list, nn_index, np_index, v_index, called_index):
  clue_list = []
  goal_list = []
  for nn in nn_index:
    for np in np_index:
      for v in v_index:
        for c in called_index:
          if (nn < v < c < np):
              clue = token_list[nn-1] + ' ' + token_list[nn] + ' ' + token_list[v] + ' ...'
              goal = token_list[np-1] + ' ' + token_list[np]
              clue_list.append(clue)
              goal_list.append(goal)

              clue = token_list[np-1]  + ' ' + token_list[np] + ' ' + token_list[v] + ' ...'
              goal = token_list[nn-1] + ' ' + token_list[nn]
              clue_list.append(clue)
              goal_list.append(goal)
  return clue_list, goal_list

def pattern_5(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and contains_nnp(pos_list) and ('VBZ' in pos_list or 'VBP' in pos_list) and contains_verb(pos_list) and 'called' in token_list:
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_index_list_v2(pos_list, ['NNP','NNPS'])
    v_index = generate_index_list_v2(pos_list, ['VBP','VBZ'])
    called_index = generate_index_list(token_list, 'called')
    return possible_combinations_p5(token_list, pos_list, nn_index, np_index, v_index, called_index) 
  return clue_list, goal_list

NP1 like/likes NP2


In [None]:
#NP1 like/likes NP2 -> What NP1 likes.
def possible_combinations_p6(token_list, pos_list, nn_index, like_index):
  clue_list = []
  goal_list = []
  for nn1 in nn_index:
    for nn2 in nn_index:
      for like in like_index:
        if (nn1 < like < nn2):
            if (is_np(pos_list, nn1)):
              nn1_token = token_list[nn1-1] + ' ' + token_list[nn1] + ' '
            else:
              nn1_token = token_list[nn1] + ' '
            clue = 'What ' + nn1_token +  'likes'
            goal = token_list[nn2]
            clue_list.append(clue)
            goal_list.append(goal)
  return clue_list, goal_list

def pattern_6(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and ('like' in token_list or 'likes' in token_list):
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    like_index = generate_index_list_v2(token_list, ['like','likes'])
    if (len(nn_index) > 1):
      return possible_combinations_p6(token_list, pos_list, nn_index, like_index) 
  return clue_list, goal_list

NP1 VERB NP2 xxx

In [None]:
# PRP NN1/NNS1 VERB PRP NN2/NNS2 xxx -> What PRP NN1/NNS1 VERB xxx.
def possible_combinations_p7(token_list, pos_list, np_index, verb_index):
  clue_list = []
  goal_list = []
  for np1 in np_index:
    for np2 in np_index:
      for verb in verb_index:
          if (np1 < verb < np2):
            xxx = [token_list[elem] for elem in range(np2+1,len(token_list)-1)]  
            final = " "   
            for elem in xxx:
              final = final + elem + " "     
            clue = 'What ' + token_list[np1-1] + ' ' + token_list[np1] + ' ' + token_list[verb] + final
            goal = token_list[np2-1] + ' ' + token_list[np2]
            clue_list.append(clue)
            goal_list.append(goal)
  return clue_list, goal_list

def pattern_7(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and contains_verb(pos_list) and 'CC' in pos_list:
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_np_index(pos_list, nn_index)
    verb_index = generate_index_list_v2(pos_list, ['VB','VBD','VBG','VBN', 'VBP', 'VBZ'])
    if (len(np_index) > 1):
      return possible_combinations_p7(token_list, pos_list, np_index, verb_index) 
  return clue_list, goal_list

 I live/lives in a house XXX -> My house is XXX

In [None]:
#PATTERN: PRP live -> IN -> house -> XXX 
def possible_combinations_p8(token_list, pos_list, prp_index, in_index, liv_index, hs_index):
  clue_list = []
  goal_list = []
  for prp in prp_index:
    for liv in liv_index:
      for inn in in_index:
        for hs in hs_index:
          if (prp < liv < inn < hs):
            xxx = [token_list[elem] for elem in range(hs+1,len(token_list)-1)]  
            goal = ''
            clue = 'My house is ...'
            for elem in xxx:
              goal = goal + elem + " " 
            clue_list.append(clue)
            goal_list.append(goal)
  return clue_list, goal_list

def pattern_8(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if 'PRP' in pos_list and 'IN' in pos_list  and 'live' in token_list and 'house' in token_list:
    prp_index = generate_index_list(pos_list, 'PRP')
    in_index = generate_index_list(pos_list, 'IN')
    liv_index = generate_index_list(token_list, 'live')   
    hs_index = generate_index_list(token_list, 'house')   
    return possible_combinations_p8(token_list, pos_list, prp_index, in_index, liv_index, hs_index) 
  return clue_list, goal_list

 NP live/lives in a house XXX -> NP’s house is XXX

In [None]:
#PATTERN: PRP NN -> live/lives in -> house -> XXX 
def possible_combinations_p9(token_list, pos_list, np_index, in_index, liv_index, hs_index):
  clue_list = []
  goal_list = []
  for np in np_index:
    for liv in liv_index:
      for inn in in_index:
        for hs in hs_index:
          if (np < liv < inn < hs):
            xxx = [token_list[elem] for elem in range(hs+1,len(token_list)-1)]  
            goal = ''
            clue = token_list[np-1] + ' ' + token_list[np]  + ' house is ' 
            for elem in xxx:
              goal = goal + elem + " " 
            clue_list.append(clue)
            goal_list.append(goal)
  return clue_list, goal_list

def pattern_9(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and contains_nnp(pos_list) and 'IN' in pos_list and ('live' in token_list or 'lives' in token_list) and 'house' in token_list:
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_index_list_v2(pos_list, ['NNP','NNPS'])
    in_index = generate_index_list(pos_list, 'IN')
    liv_index = generate_index_list_v2(token_list, ['live','lives'])
    hs_index = generate_index_list(token_list, 'house')   
    return possible_combinations_p9(token_list, pos_list, np_index, in_index, liv_index, hs_index) 
  return clue_list, goal_list

NP live/lives in XXX -> NP’s house is in XXX 

In [None]:
#PATTERN: PRP NN -> live/lives in XXX, house NOT in XXX -> PRP NN house is in...
def possible_combinations_p10(token_list, pos_list, np_index, in_index, liv_index):
  clue_list = []
  goal_list = []
  for np in np_index:
    for liv in liv_index:
      for inn in in_index:
        if (np < liv < inn):
          xxx = [token_list[elem] for elem in range(inn+1,len(token_list)-1)]  
          if 'house' not in xxx:
            goal = ''
            clue = token_list[np-1] + ' ' + token_list[np] + ' house is ...' 
            for elem in xxx:
              goal = goal + elem + " " 
            clue_list.append(clue)
            goal_list.append(goal)
  return clue_list, goal_list

def pattern_10(token_list, pos_list, tuple_list):
  clue_list = []
  goal_list = []
  if contains_nn(pos_list) and contains_nnp(pos_list) and 'IN' in pos_list and ('live' in token_list or 'lives' in token_list):
    nn_index = generate_index_list_v2(pos_list, ['NN','NNS','NNP','NNPS'])
    np_index = generate_index_list_v2(pos_list, ['NNP','NNPS'])
    in_index = generate_index_list(pos_list, 'IN')
    liv_index = generate_index_list_v2(token_list, ['live','lives'])
    return possible_combinations_p10(token_list, pos_list, np_index, in_index, liv_index) 
  return clue_list, goal_list

### Test Execution

En primer lugar, procesamos cada uno de los textos de la siguiente manera:  
* Obtenemos todos los **tokens** o palabras que los textos contienen.
* Aplicamos **POS tagging** para cada texto, obteniendo el **tag** adecuado para cada **token**.
* Obtenemos las **entidades con nombre** para cada texto, utilizando **nltk**.
* Obtenemos las **entidades con nombre** para cada texto, utilizando **spicy**.

In [None]:
DATASET_PATH = "/content/drive/MyDrive/Semestre Impar 2022/Modulos PLN/ME/dataset-lingua/"
all_texts, filenames = read_dataset_files(DATASET_PATH)
all_texts, all_tokens = extract(all_texts)
texts_pos = get_pos_features(all_texts)
texts_named_entities_nltk = get_text_named_entities_nltk(texts_pos)
texts_named_entities_spicy = get_text_named_entities_spicy(all_texts)

## Extracción de patrones

En base al análisis manual de definiciones y pistas para crucigramas, la idea es construir patrones adecuados para lograr realizar la extracción automática de definiciones a partir de textos.

También nos basaremos en el proyecto de grado realizado por Esteche y Romero en 2015.

### Posibles patrones extraídos

Presentamos a continuación una lista con posibles patrones a utilizar para intentar extraer de cierto corpus, oraciones que contengan un par definición/definendum, con los cuales intentar armar pistas para crucigramas.  
Los patrones fueron extraídos de los ejemplos analizados previamente de https://lingua.com/english/reading/




* NP VERB **…XXX…**
* NP is a [list of professions] → NP is a **XXX**. 
* NP1 is a [list of professions], NP1 works at NP2 → **XXX** works at NP2, NP1 works at **XXX**.
* NP1 VERB NP2 →  NP1 **…XXX…** NP2, **…XXX…** VERB NP2, NP1 VERB **…XXX…**
* NP is XXX, PRONOUN is YYY → NP is **…YYY…**
* NP VERB XXX and YYY → NP VERB **…XXX…**, NP VERB **…YYY…**, not VERB in …XXX…, …YYY… 
* NP1 is called NP2 →  NP2 is **…XXX…**, NP1 is **…XXX…**
* **XXX** there [are, is] YYY in NP → Place where there [are, is] NP.
* **XXX** NP1 VERB in NP2 →  Place where NP1 VERB.
* [NP,he,she] is NUMBER → [NP,he,she] is **XXX** years old.
* NP … . [he, she] is NUMBER → NP is **XXX** years old.
* NP … . [he, she] is …XXX… → NP is **XXX**.
* NP1 VERB1 …then… NP2 VERB2 → After NP1 VERB1 NP2 **XXX**, Before NP2 VERB2, NP1 **XXX**.
* NP1 [like, likes] NP2 → What NP1 likes. 
* NP1 VERB NP2 … → What NP1 VERB **XXX**.
















### Aplicación de patrones

#### Ejemplo de aplicación

Mostraremos a continuación algunos ejemplos para ilustrar lo que queremos lograr, trabajando por ejemplo con el siguiente texto:

**Texto 1: My Wonderful Family**

I live in a house near the mountains. I have two brothers and one sister, and I was born last. My father teaches mathematics, and my mother is a nurse at a big hospital. My brothers are very smart and work hard in school. My sister is a nervous girl, but she is very kind. My grandmother also lives with us. She came from Italy when I was two years old. She has grown old, but she is still very strong. She cooks the best food!

My family is very important to me. We do lots of things together. My brothers and I like to go on long walks in the mountains. My sister likes to cook with my grandmother. On the weekends we all play board games together. We laugh and always have a good time. I love my family very much.

Aplicamos los siguientes patrones:
* NP is a [list of professions] → NP is a **XXX**.
* NP1 is a [list of professions], NP1 works at NP2 → **XXX** works at NP2, NP1 works at **XXX**.
* NP is XXX, PRONOUN is YYY → NP is **…YYY…**
* NP VERB XXX and YYY → NP VERB **…XXX…**, NP VERB **…YYY…**, not VERB in …XXX…, …YYY… 
* NP1 is called NP2 →  NP2 is **…XXX…**, NP1 is **…XXX…**
* NP1 like/likes NP2
* NP1 VERB NP2 xxx
* I live/lives in a house XXX -> My house is XXX
* NP live/lives in a house XXX -> NP’s house is XXX
* NP live/lives in XXX -> NP’s house is in XXX 

In [None]:
DATASET_PATH = "/content/drive/MyDrive/Semestre Impar 2022/Modulos PLN/ME/dataset-lingua/"
all_texts_sentences, filenames = read_dataset_files_sentences(DATASET_PATH)
texts_pos_sentences = get_pos_features_sentences(all_texts_sentences)

In [None]:
for text_sentences in texts_pos_sentences:
  print("Text " + str(texts_pos_sentences.index(text_sentences) + 1) + ":")
  for pos_sent in text_sentences:
    tokens, pos = separate_tuples(pos_sent)
    clue_list_p1 = []
    goal_list_p1 = []
    clue_list_p1, goal_list_p1 = pattern_1(tokens, pos, pos_sent)
    clue_list_p2, goal_list_p2 = pattern_2(tokens, pos, pos_sent)
    clue_list_p3, goal_list_p3 = pattern_3(tokens, pos, pos_sent)
    clue_list_p4, goal_list_p4 = pattern_4(tokens, pos, pos_sent)
    clue_list_p5, goal_list_p5 = pattern_5(tokens, pos, pos_sent)
    clue_list_p6, goal_list_p6 = pattern_6(tokens, pos, pos_sent)
    clue_list_p7, goal_list_p7 = pattern_7(tokens, pos, pos_sent)
    clue_list_p8, goal_list_p8 = pattern_8(tokens, pos, pos_sent)
    clue_list_p9, goal_list_p9 = pattern_9(tokens, pos, pos_sent)
    clue_list_p10, goal_list_p10 = pattern_10(tokens, pos, pos_sent)
    clue_list = [*clue_list_p1, *clue_list_p2, *clue_list_p3, *clue_list_p4, *clue_list_p5, *clue_list_p6, *clue_list_p7, *clue_list_p8, *clue_list_p9, *clue_list_p10] 
    goal_list = [*goal_list_p1, *goal_list_p2, *goal_list_p3, *goal_list_p4, *goal_list_p5, *goal_list_p6, *goal_list_p7, *goal_list_p8, *goal_list_p9, *goal_list_p10]  
    if len(clue_list) > 0 and len(goal_list) > 0:
      print(" Clues sentence " + str(text_sentences.index(pos_sent)) + ": " + str(clue_list))
      print(" Goals sentence " + str(text_sentences.index(pos_sent)) + ": " + str(goal_list))

Text 1:
 Clues sentence 0: ['My house is ...']
 Goals sentence 0: ['near the mountains ']
 Clues sentence 2: ['my mother is a ...', 'my mother works at ...', '... works at hospital']
 Goals sentence 2: ['nurse', 'hospital', 'my mother']
 Clues sentence 4: ['My sister is ...']
 Goals sentence 4: ['kind']
 Clues sentence 11: ['What My brothers likes', 'What My brothers likes']
 Goals sentence 11: ['walks', 'mountains']
 Clues sentence 12: ['What My sister likes']
 Goals sentence 12: ['grandmother']
Text 2:
 Clues sentence 7: ['My professors are ...', 'My professors are smart and ...']
 Goals sentence 7: ['smart ', 'very friendly ']
 Clues sentence 10: ['My house is ...']
 Goals sentence 10: ['on Ivy Street ']
 Clues sentence 22: ['What My Mom brings and candy when they come ']
 Goals sentence 22: ['me sweets']
Text 3:
 Clues sentence 2: ['favorite beach is ...', 'called Emerson is ...', 'favorite beach is ...', 'Emerson Beach is ...']
 Goals sentence 2: ['called Emerson', 'favorite beach

#### Ejecutamos patrones sobre dataset "eslfast"

Este dataset está formado por textos similares a los utilizados en el ejemplo de aplicación anterior.

In [None]:
DATASET_PATH = "/content/drive/MyDrive/Semestre Impar 2022/Modulos PLN/ME/dataset-eslfast/"
all_texts_sentences, filenames = read_dataset_files_sentences(DATASET_PATH)
texts_pos_sentences = get_pos_features_sentences(all_texts_sentences)

In [None]:
for text_sentences in texts_pos_sentences:
  print("Text " + str(texts_pos_sentences.index(text_sentences) + 1) + ":")
  for pos_sent in text_sentences:
    tokens, pos = separate_tuples(pos_sent)
    clue_list_p1, goal_list_p1 = pattern_1(tokens, pos, pos_sent)
    clue_list_p2, goal_list_p2 = pattern_2(tokens, pos, pos_sent)
    clue_list_p3, goal_list_p3 = pattern_3(tokens, pos, pos_sent)
    clue_list_p4, goal_list_p4 = pattern_4(tokens, pos, pos_sent)
    clue_list_p5, goal_list_p5 = pattern_5(tokens, pos, pos_sent)
    clue_list_p6, goal_list_p6 = pattern_6(tokens, pos, pos_sent)
    clue_list_p7, goal_list_p7 = pattern_7(tokens, pos, pos_sent)
    clue_list_p8, goal_list_p8 = pattern_8(tokens, pos, pos_sent)
    clue_list_p9, goal_list_p9 = pattern_9(tokens, pos, pos_sent)
    clue_list_p10, goal_list_p10 = pattern_10(tokens, pos, pos_sent)
    clue_list = [*clue_list_p1, *clue_list_p2, *clue_list_p3, *clue_list_p4, *clue_list_p5, *clue_list_p6, *clue_list_p7, *clue_list_p8, *clue_list_p9, *clue_list_p10] 
    goal_list = [*goal_list_p1, *goal_list_p2, *goal_list_p3, *goal_list_p4, *goal_list_p5, *goal_list_p6, *goal_list_p7, *goal_list_p8, *goal_list_p9, *goal_list_p10]
    if len(clue_list) > 0 and len(goal_list) > 0:
      print(" Clues sentence " + str(text_sentences.index(pos_sent)) + ": " + str(clue_list))
      print(" Goals sentence " + str(text_sentences.index(pos_sent)) + ": " + str(goal_list))

Text 1:
Text 2:
 Clues sentence 9: ['What Tim likes']
 Goals sentence 9: ['gift']
Text 3:
Text 4:
Text 5:
Text 6:
Text 7:
Text 8:
Text 9:
 Clues sentence 0: ['What Jill likes']
 Goals sentence 0: ['math']
Text 10:
Text 11:
 Clues sentence 3: ['What Kate likes']
 Goals sentence 3: ['dog']
Text 12:
Text 13:
Text 14:
Text 15:
 Clues sentence 10: ['What Barbara likes']
 Goals sentence 10: ['one']
Text 16:
 Clues sentence 2: ['Her mom takes a ...']
 Goals sentence 2: ['dentist']
Text 17:
Text 18:
Text 19:
Text 20:
Text 21:
Text 22:
Text 23:
Text 24:
Text 25:
Text 26:
Text 27:
Text 28:
Text 29:
Text 30:
Text 31:
Text 32:
Text 33:
Text 34:
Text 35:
Text 36:
 Clues sentence 2: ['from Canada house is ...']
 Goals sentence 2: ['Nevada ']
Text 37:
Text 38:
Text 39:
 Clues sentence 9: ['What one likes']
 Goals sentence 9: ['flu']
Text 40:
Text 41:
Text 42:
Text 43:
Text 44:
Text 45:
Text 46:
Text 47:
Text 48:
Text 49:
Text 50:
Text 51:
Text 52:
Text 53:
Text 54:
Text 55:
Text 56:
Text 57:
Text 58:

#### Ejecutamos patrones sobre dataset "readworks"

Este dataset está formado por textos un poco más complejos, en donde los patrones implementados no funcionan muy bien.

In [None]:
DATASET_PATH = "/content/drive/MyDrive/Semestre Impar 2022/Modulos PLN/ME/dataset-readworks/"
all_texts_sentences, filenames = read_dataset_files_sentences(DATASET_PATH)
texts_pos_sentences = get_pos_features_sentences(all_texts_sentences)

In [None]:
for text_sentences in texts_pos_sentences:
  print("Text " + str(texts_pos_sentences.index(text_sentences) + 1) + ":")
  for pos_sent in text_sentences:
    tokens, pos = separate_tuples(pos_sent)
    clue_list_p1, goal_list_p1 = pattern_1(tokens, pos, pos_sent)
    clue_list_p2, goal_list_p2 = pattern_2(tokens, pos, pos_sent)
    clue_list_p3, goal_list_p3 = pattern_3(tokens, pos, pos_sent)
    clue_list_p4, goal_list_p4 = pattern_4(tokens, pos, pos_sent)
    clue_list_p5, goal_list_p5 = pattern_5(tokens, pos, pos_sent)
    clue_list_p6, goal_list_p6 = pattern_6(tokens, pos, pos_sent)
    clue_list_p7, goal_list_p7 = pattern_7(tokens, pos, pos_sent)
    clue_list_p8, goal_list_p8 = pattern_8(tokens, pos, pos_sent)
    clue_list_p9, goal_list_p9 = pattern_9(tokens, pos, pos_sent)
    clue_list_p10, goal_list_p10 = pattern_10(tokens, pos, pos_sent)
    clue_list = [*clue_list_p1, *clue_list_p2, *clue_list_p3, *clue_list_p4, *clue_list_p5, *clue_list_p6, *clue_list_p7, *clue_list_p8, *clue_list_p9, *clue_list_p10] 
    goal_list = [*goal_list_p1, *goal_list_p2, *goal_list_p3, *goal_list_p4, *goal_list_p5, *goal_list_p6, *goal_list_p7, *goal_list_p8, *goal_list_p9, *goal_list_p10]
    if len(clue_list) > 0 and len(goal_list) > 0:
      print(" Clues sentence " + str(text_sentences.index(pos_sent)) + ": " + str(clue_list))
      print(" Goals sentence " + str(text_sentences.index(pos_sent)) + ": " + str(goal_list))