<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia


In [1]:
import json
import string
import random
import re # Regular Expressions (regex)
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /home/rafael/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/rafael/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/rafael/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumirán los datos del artículo de wikipedia sobre el deporte "Tennis" en inglés.

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX')
raw_html = raw_html.read()

# Parsear artículo, 'lxml' es el parser a utilizar
article_html = bs.BeautifulSoup(raw_html, 'lxml')

# Encontrar todos los párrafos del HTML (bajo el tag <p>)
# y tenerlos disponible como lista
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [3]:
# Demos un vistazo
article_text

'\nthe space exploration technologies corporation (spacex)[9] is an american spacecraft manufacturer, launcher, and satellite communications corporation headquartered in hawthorne, california. it was founded in 2002 by elon musk with the stated goal of reducing space transportation costs to enable the colonization of mars. the company manufactures the falcon 9, falcon heavy, and starship launch vehicles; several rocket engines; cargo dragon and crew dragon spacecraft; and starlink communications satellites.\nspacex offers commercial satellite-based internet service via its constellation of starlink satellites, which became the largest-ever satellite constellation in january 2020 and as of december 2022 comprised more than 3,300 small satellites in orbit.[10]\nthe company is also developing starship, a privately funded, fully reusable, super heavy-lift launch system for interplanetary and orbital spaceflight. it is intended to become spacex\'s primary orbital vehicle, supplanting the ex

In [4]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 46048


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [5]:
# Repaso de regex:
# https://docs.python.org/3/library/re.html

# Para practicar regex:
# https://regex101.com/

# el inicio con 'r' antes de cada string indica que se interprete como raw string
# '\n' es interpretado por Python como salto de linea
# r'\n' es interpretado por Python como el string formado por dos caracteres: 
#  backslash y n

# substituir con regex con espacio vacío:
text = re.sub(r'\[[0-9]*\]', ' ', article_text) # substituir los números entre corchetes
# (notar que los corchetes son interpretados literalmente por los backlsash)
text = re.sub(r'\s+', ' ', text) # substituir más de un caracter de espacio, salto de línea o tabulación

# probar en regex101 con los patrones anteriores:
# 'Hola [1], [], [ estoy bien   [123]. [12sss]. OK!   .'

In [6]:
# Demos un vistazo
text

' the space exploration technologies corporation (spacex) is an american spacecraft manufacturer, launcher, and satellite communications corporation headquartered in hawthorne, california. it was founded in 2002 by elon musk with the stated goal of reducing space transportation costs to enable the colonization of mars. the company manufactures the falcon 9, falcon heavy, and starship launch vehicles; several rocket engines; cargo dragon and crew dragon spacecraft; and starlink communications satellites. spacex offers commercial satellite-based internet service via its constellation of starlink satellites, which became the largest-ever satellite constellation in january 2020 and as of december 2022 comprised more than 3,300 small satellites in orbit. the company is also developing starship, a privately funded, fully reusable, super heavy-lift launch system for interplanetary and orbital spaceflight. it is intended to become spacex\'s primary orbital vehicle, supplanting the existing fal

In [7]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 44728


### 3 - Dividir el texto en sentencias y en palabras

In [8]:
corpus = nltk.sent_tokenize(text) # divide en oraciones
words = nltk.word_tokenize(text) # divide en términos

In [9]:
# Demos un vistazo
corpus[:10]

[' the space exploration technologies corporation (spacex) is an american spacecraft manufacturer, launcher, and satellite communications corporation headquartered in hawthorne, california.',
 'it was founded in 2002 by elon musk with the stated goal of reducing space transportation costs to enable the colonization of mars.',
 'the company manufactures the falcon 9, falcon heavy, and starship launch vehicles; several rocket engines; cargo dragon and crew dragon spacecraft; and starlink communications satellites.',
 'spacex offers commercial satellite-based internet service via its constellation of starlink satellites, which became the largest-ever satellite constellation in january 2020 and as of december 2022 comprised more than 3,300 small satellites in orbit.',
 'the company is also developing starship, a privately funded, fully reusable, super heavy-lift launch system for interplanetary and orbital spaceflight.',
 "it is intended to become spacex's primary orbital vehicle, supplant

In [10]:
# Demos un vistazo
words[:20]

['the',
 'space',
 'exploration',
 'technologies',
 'corporation',
 '(',
 'spacex',
 ')',
 'is',
 'an',
 'american',
 'spacecraft',
 'manufacturer',
 ',',
 'launcher',
 ',',
 'and',
 'satellite',
 'communications',
 'corporation']

In [11]:
print("Vocabulario:", len(words))

Vocabulario: 7963


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# ord() nos da el código Unicode para un caracter dado
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula (string.lower())
    # 2 - quitar los simbolos de puntuacion (string.translate())
    # 3 - realiza la tokenización (nltk.word_tokenize)
    # 4 - realiza la lematización (nuestra función perform_lemmatization)
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus del artículo de wikipedia

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number] # obtener el documento del corpus más similar
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias a ensayar:

In [14]:
inputs = [
    'Falcon 9',
    'Falcon Heavy',
    'Grasshopper',
    'Boca Chica',
    'Merlin',
    'BFR',
    'MCT',
    'Starship',
    'Shotwell',
    'Muller',
    'RTLS',
    'Ocean',
    'ULA',
    'Department of Defense',
    'Vandenberg',
    'Falcon 1',
    'Kestrel',
    'Raptor',
    'Landing',
    'Full Thrust',
    'International Space Station',
    'Competition',
    'Boeing',
    'Russia',
    'Europe',
    'Launch',
    'Starlink',
    'Orbit'
]

In [15]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio



In [16]:
for input in inputs:
    print(f"Input: {input}")
    print(f"Output: {generate_response(input.lower(), corpus)}")
    print()

Input: Falcon 9




Output: the vehicle was upgraded to falcon 9 v1.1 in 2013, falcon 9 full thrust in 2015, and finally to falcon 9 block 5 in 2018. the first stage of falcon 9 is designed to retropropulsively land, be recovered, and reflown.

Input: Falcon Heavy
Output: all but one (a falcon heavy in november) was on a falcon 9 rocket.

Input: Grasshopper
Output: all spacex rocket engines are tested on rocket test stands, and low-altitude vtvl flight testing of the falcon 9 grasshopper in 2012–2013 were carried out at mcgregor.

Input: Boca Chica
Output: spacex started manufacturing the first prototypes of starship in 2019 at the company's facility in boca chica, texas, later renamed starbase.

Input: Merlin
Output: it has nine merlin engines in its first stage.

Input: BFR
Output: I am sorry, I could not understand you

Input: MCT
Output: I am sorry, I could not understand you

Input: Starship
Output: these are modified oil rigs to use in the 2020s to provide a sea launch option for their second-genera

### Alumno

- Tomar un ejemplo de los bots utilizados (uno de los dos) y construir el propio.
- Sacar conclusiones de los resultados.

__IMPORTANTE__: Recuerde para la entrega del ejercicio debe quedar registrado en el colab las preguntas y las respuestas del BOT para que podamos evaluar el desempeño final.

Conclusión: Este método es simple y rápido de entrenar y funciona bien para buscar sugerencias o términos básicos en un texto, pero las respuestas a menudo no son las mejores disponibles en el texto y este bot no tiene capacidad de generalización. Solo repite lo que está en el texto original.