<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia (FIFA World Cup)


In [1]:
import json
import string
import random
import re
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Datos
Se consumira los datos del artículo de wikipedia sobre la "historia de los mundiales de futbol" en ingles.

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/FIFA_World_Cup')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

# paso todo a minúscula
article_text = article_text.lower()

In [3]:
article_text

'\nthe fifa world cup, often simply called the world cup, is an international association football competition contested by the senior men\'s national teams of the members of the fédération internationale de football association (fifa), the sport\'s global governing body. the championship has been awarded every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the second world war. the current champion is france, which won its second title at the 2018 tournament in russia.\nthe current format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. in the tournament phase, 32 teams, including the automatically qualifying host nation(s), compete for the title at venues within the host nation(s) over about a month.\nas of the 2018 fifa world cup, twenty-one final tournaments have been held and a total of 79 national teams have competed. the trophy h

In [4]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 33869


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [5]:
text = re.sub(r'\[[0-9]*\]', ' ', article_text)
text = re.sub(r'\s+', ' ', text)

In [6]:
text

' the fifa world cup, often simply called the world cup, is an international association football competition contested by the senior men\'s national teams of the members of the fédération internationale de football association (fifa), the sport\'s global governing body. the championship has been awarded every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the second world war. the current champion is france, which won its second title at the 2018 tournament in russia. the current format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. in the tournament phase, 32 teams, including the automatically qualifying host nation(s), compete for the title at venues within the host nation(s) over about a month. as of the 2018 fifa world cup, twenty-one final tournaments have been held and a total of 79 national teams have competed. the trophy has 

In [7]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 33360


### 3 - Dividir el texto en sentencias y en palabras

In [8]:
corpus = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

In [9]:
# Demos un vistazo
corpus[:10]

[" the fifa world cup, often simply called the world cup, is an international association football competition contested by the senior men's national teams of the members of the fédération internationale de football association (fifa), the sport's global governing body.",
 'the championship has been awarded every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the second world war.',
 'the current champion is france, which won its second title at the 2018 tournament in russia.',
 'the current format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase.',
 'in the tournament phase, 32 teams, including the automatically qualifying host nation(s), compete for the title at venues within the host nation(s) over about a month.',
 'as of the 2018 fifa world cup, twenty-one final tournaments have been held and a total of 79 national teams have compet

In [10]:
# Demos un vistazo
words[:20]

['the',
 'fifa',
 'world',
 'cup',
 ',',
 'often',
 'simply',
 'called',
 'the',
 'world',
 'cup',
 ',',
 'is',
 'an',
 'international',
 'association',
 'football',
 'competition',
 'contested',
 'by']

In [11]:
print("Vocabulario:", len(words))

Vocabulario: 6407


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula
    # 2 - quitar los simbolos de puntuacion
    # 3 - realiza la tokenización
    # 4 - realiza la lematización
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus de wikipedia

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias ensayar:
- brazil
- 1986
- klose
- roland garros

In [14]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio --quiet

[K     |████████████████████████████████| 5.1 MB 17.4 MB/s 
[K     |████████████████████████████████| 272 kB 69.5 MB/s 
[K     |████████████████████████████████| 57 kB 4.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 56.9 MB/s 
[K     |████████████████████████████████| 84 kB 3.0 MB/s 
[K     |████████████████████████████████| 2.3 MB 59.7 MB/s 
[K     |████████████████████████████████| 212 kB 66.4 MB/s 
[K     |████████████████████████████████| 54 kB 3.4 MB/s 
[K     |████████████████████████████████| 140 kB 73.0 MB/s 
[K     |████████████████████████████████| 84 kB 4.0 MB/s 
[K     |████████████████████████████████| 144 kB 62.7 MB/s 
[K     |████████████████████████████████| 94 kB 2.7 MB/s 
[K     |████████████████████████████████| 271 kB 64.8 MB/s 
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[K     |████████████████████████████████| 80 kB 3.8 MB/s 
[K     |████████████████████████████████| 68 kB 6.5 MB/s 
[K     |███████████████████████████████

In [15]:
import gradio as gr

def bot_response(human_text):
    print(human_text)
    return generate_response(human_text.lower(), corpus)

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://32782.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


brazil


  % sorted(inconsistent)


1986
klose
roland garros
Keyboard interruption in main thread... closing server.


(<gradio.routes.App at 0x7fa4b2b51f50>,
 'http://127.0.0.1:7860/',
 'https://32782.gradio.app')


### 7 - Resumen

Se analizó el bot desarrollado en clase con la librería NLTK. Como documento de entrada se utilizó el artículo de Wikipedia de [FIFA World Cup](https://en.wikipedia.org/wiki/FIFA_World_Cup).

La primera tarea fue preprocesar el documento. Se utilizó la librería Sklearn para hacer TFIDF y la similitud coseno.

Se ensayó con algunas palabras, el resultado a continuación:

* *brazil*: brazil have won five times, and they are the only team to have played in every tournament.

* *1986*: argentina won a world cup in north america in 1986, while spain won in africa in 2010. in 2014, germany became the first european team to win in the americas.

* *klose*: miroslav klose of germany (2002–2014) is the all-time top scorer at the world cup with 16 goals.

* *roland garros*: I am sorry, I could not understand you