# NLP Fundamentals with Python

### Gonzalo Estrán Buyo, *Economics & AI Research.* 
Twitter: @__[sigmoider](https://twitter.com/sigmoider)__

# OUTLINE:

### 1. Introducción
### 2. Areas principales
### 3. Arquitectura
### 4. Hands-on
### 5. State of the art
### 6. Aplicaciones

## 1. INTRODUCCIÓN
## *Qué <span style="color:red">(NO)</span> es NLP*

- **QUÉ NO ES NLP:**
    - Neuro-Linguistic Programming
    - Magia
    - La solución para todo

<img src="https://cdn-images-1.medium.com/max/1600/1*GBCgmit0r4HkTR-92PizQw.jpeg" align="right">

- **QUÉ ES NLP:**
    - Natural Language Processing
    - Un área de la IA: Lingüística + CS (ML)
    - Características:
        * Ambigüedad
        * Sinonimia
        * Sintaxis
        * Correferencias / anáforas
        * Normalización vs información
        * Representación
        * Personalidad / intención / estilo

*Ejemplo de coreferencia:*
<img src="https://cdn-images-1.medium.com/max/1600/0*d6NfdzNYB5tFZKi-.png" align="left">

## 2. AREAS PRINCIPALES

## *Principales retos*

### 1. Estemizado y lematización 

Reducción de diferentes variantes de una palabra a una sola forma básica.

<img src="https://cdn-images-1.medium.com/max/1600/0*yTnH6cyfOK9oL-0D.png" align="middle">

### 2. Resolución de correferencias

<img src="https://camo.githubusercontent.com/94157dbf6ab835f0608aa44d8fca92b4ae74eeec/68747470733a2f2f68756767696e67666163652e636f2f636f7265662f6173736574732f7468756d626e61696c2d6c617267652e706e67" align="middle">

### 3. Part-of-speech (POS) Tagging

Etiquetar cada palabra de un corpus asignándole una categoría sintáctica que puede ser: 
* **abierta**: nombre, verbo, adjetivo, adverbio.
* **cerrada**: preposición, determinante, pronombre, conjunción, verbo auxiliar, partícula, numeral.

<img src="https://cdn-images-1.medium.com/max/1600/1*fRjvBbgzo90x0MZdXZT82A.png" align="middle">

### 4. Dependency parsing

Obtiene las dependencias o relaciones entre palabras como un árbol. Las dependencias consideradas son generalmente relaciones de **sujeto, objeto, complemento y modificador**.
 
<img src="https://cdn-images-1.medium.com/max/1600/1*y7FVJfgBYOG6p2Je84FVXg.png" align="middle">

### 5. Named Entity Recognition (NER)

Trata de clasificar en categorías predefinidas (persona, lugar, tiempo, cantidad, etc.) las entidades de un un texto.
 
<img src="https://www.depends-on-the-definition.com/wp-content/uploads/2018/12/namedentityextraction-945x468.png" align="middle">

### 6. Topic Modeling

Agrupa las palabras más representativas del corpus en temáticas o materias.

<img src="https://cdn-images-1.medium.com/max/800/0*3YiGp_YuwcwPNfM8.png" align="middle">

### 7. Otros problemas de NLP

- Sentiment Analysis
- Natural Language Generation (NLG)
- Question Answering
- Text summarization
- Relation extraction
- Error Correction (grammatical, syntactic, spelling)
- Machine Transalation
 
<img src="https://cdn-images-1.medium.com/max/800/0*lUAhiA6FDazmoDUx.gif" align="middle">

## 3. ARQUITECTURA
## *Juntando las piezas*

### Cuatro fases principales:
**1.** *Preprocesamiento del corpus:* preparación.

**2.** *Estructuración:* identificación.

**3.** *Análisis:* extracción.

**4.** *Transformación:* ejecución de la decisión.

<img src="https://cdn-images-1.medium.com/max/1200/1*OgD8_FmQFBJ_cKst_H3WWw.jpeg" align="middle">

## 4. HANDS-ON
## *Caso práctico*

### Datos:

__[Medium Articles](https://www.kaggle.com/hsankesara/medium-articles)__: una colección de 338 artículos sobre ML, AI, y Data Science extraídos de Medium (3,9 MB).

### Dependencias:

- __[Pandas](https://pandas.pydata.org)__: tratamiento de datos (se asume que ya está instalado).
- __[SpaCy](https://spacy.io)__: modelos de NLP.
- __[re](https://docs.python.org/3/library/re.html)__: expresiones regulares (viene por defecto).

<img src="https://pandas.pydata.org/_static/pandas_logo.png" width="250" align="left"> 
<br clear="all" />
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/2000px-SpaCy_logo.svg.png" width="150" align="left">

**Para instalar SpaCy con Conda y el módulo de NLP en inglés:**

In [None]:
# Para instalar SpaCy con Conda y el módulo de NLP en inglés:

!conda install -c conda-forge spacy
!python -m spacy download en

**Para instalarlo usando pip:**

In [None]:
# Para instalar SpaCy con Conda y el módulo de NLP en inglés:

!pip install -U spacy
!python -m spacy download en

**Imports:**

In [20]:
import pandas as pd
import re
import spacy

nlp = spacy.load('en') # cargamos módulo de inglés

**Cargamos el corpus:**

In [21]:
url = 'https://raw.githubusercontent.com/gonesbuyo/saturdays-ai/master/articles.csv'
df = pd.read_csv(url)

df.head(4)

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...


In [22]:
print('Filas (artículos):', list(df.shape)[0])
print('Columnas (campos):', list(df.shape)[1], '\n')

campos = list(df.keys())
print(campos, '\n')

for c in campos:
    print(c.upper() +':', df[c][0], '\n')

Filas (artículos): 337
Columnas (campos): 6 

['author', 'claps', 'reading_time', 'link', 'title', 'text'] 

AUTHOR: Justin Lee 

CLAPS: 8.3K 

READING_TIME: 11 

LINK: https://medium.com/swlh/chatbots-were-the-next-big-thing-what-happened-5fc49dd6fa61?source=---------0---------------- 

TITLE: Chatbots were the next big thing: what happened? – The Startup – Medium 

TEXT: Oh, how the headlines blared:
Chatbots were The Next Big Thing.
Our hopes were sky high. Bright-eyed and bushy-tailed, the industry was ripe for a new era of innovation: it was time to start socializing with machines.
And why wouldn’t they be? All the road signs pointed towards insane success.
At the Mobile World Congress 2017, chatbots were the main headliners. The conference organizers cited an ‘overwhelming acceptance at the event of the inevitable shift of focus for brands and corporates to chatbots’.
In fact, the only significant question around chatbots was who would monopolize the field, not whether chatbots w

**Lemmatizing, POS tagging, dependencias, **

In [23]:
doc = nlp(u'This is a sentence written in Spain that mentions Facebook and can be found at http://google.com')

df_token = pd.DataFrame()

for i, token in enumerate(doc):
    df_token.loc[i, 'text'] = token.text
    df_token.loc[i, 'lemma'] = token.lemma_,
    df_token.loc[i, 'pos'] = token.pos_
    df_token.loc[i, 'tag'] = token.tag_
    df_token.loc[i, 'dep'] = token.dep_
    df_token.loc[i, 'shape'] = token.shape_
    df_token.loc[i, 'is_alpha'] = token.is_alpha
    df_token.loc[i, 'is_stop'] = token.is_stop

In [24]:
df_token

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop
0,This,this,DET,DT,nsubj,Xxxx,True,True
1,is,"(be,)",VERB,VBZ,ROOT,xx,True,True
2,a,"(a,)",DET,DT,det,x,True,True
3,sentence,"(sentence,)",NOUN,NN,nsubjpass,xxxx,True,False
4,written,"(write,)",VERB,VBN,acl,xxxx,True,False
5,in,"(in,)",ADP,IN,prep,xx,True,True
6,Spain,"(spain,)",PROPN,NNP,pobj,Xxxxx,True,False
7,that,"(that,)",ADJ,WDT,nsubj,xxxx,True,True
8,mentions,"(mention,)",VERB,VBZ,relcl,xxxx,True,False
9,Facebook,"(facebook,)",PROPN,NNP,dobj,Xxxxx,True,False


**Dependecy Parsing**

[DisplaCy](https://explosion.ai/demos/displacy)

**Expresiones regulares**

Nos permiten 'matchear' secuencias de caracteres que cumplen condiciones concretas.

Para aprender: __[RegEx101](https://regex101.com/)__



In [25]:
l = []
for w in doc:
    word = w.string
    word = re.sub(r'(^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?)', '[URL]', word)
    l.append(word)
    
print(l)

['This ', 'is ', 'a ', 'sentence ', 'written ', 'in ', 'Spain ', 'that ', 'mentions ', 'Facebook ', 'and ', 'can ', 'be ', 'found ', 'at ', '[URL]']


**Puntuaciones y stopwords**

In [26]:
punctuations = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


## 5. STATE OF THE ART
## *Deep NLP*

* RNNs / LSTMs
* word2vec / GloVe / fastText
* Reinforcement Learning?

## **6. APLICACIONES**
## *Ontologías, chatbots y otros*

- **Ontologías**

<img src="https://enterprise-knowledge.com/cms/assets/uploads/2017/02/Ontology_Design.png" align="middle">

- **Chatbots**

<img src="https://cdn-images-1.medium.com/max/1600/1*IS8nFJznvfsmLRLv-cKyDw.png" align="middle">

## **CÓMO EMPEZAR**

* Estadística + ML: 
    * Cursos: Udacity, Coursera, Udemy.
    * Artículos: Medium
    * Libros: *Hands‑On Machine Learning with Scikit‑Learn and TensorFlow* (A. Géron), *Machine Learning* (Mitchell).
    
    
* Papers:
    * [paperswithcode](http://paperswithcode.com)
    * [Must-Read NLP Papers](http://masatohagiwara.net/100-nlp-papers/)

## **RECURSOS**

# Thanks!