### Ejemplos sesión Text Analytics I
#### En esta sesión hemos visto algunas técnicas de procesamiento de texto clásicos, así como algunos métodos de vectorización de texto.
#### En este pequeño notebook tenéis varios snipets de código para experimentar 

In [1]:
import nltk
import numpy as np

In [2]:
raw_dataset = [
"En un lugar de la mancha de cuyo nombre no quiero acordarme no ha mucho tiempo que vivia...", 
"Con 100 cañones por banda viento en popa a toda vela, no corta el mar sino vuela...", 
"Volverán las oscuras golondrinas en tu balcón sus nidos a colgar, y otra vez con el ala a sus cristales..."
]

# Algunas operaciones sencillas de preproceso

### Tokenizacion

In [3]:
dataset = [nltk.word_tokenize(sentence) for sentence in raw_dataset] 

### Eliminar puntuación

In [4]:
dataset = [[token 
            for token in sentence if token.isalnum()]
            for sentence in dataset]

### Stop words

In [5]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('spanish'))
dataset = [[token 
            for token in sentence if token not in stop_words]
            for sentence in dataset]

### Stemming

In [6]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
dataset = [[ps.stem(token) 
            for token in sentence]
            for sentence in dataset]

# Vector models

### Definimos una función de tokenización customizada

In [8]:
def tokenize(sentence):
    tokens = nltk.word_tokenize(sentence)
    stop_words = set(stopwords.words('spanish'))
    tokens2 = [token 
               for token in tokens  if token not in stop_words]
    stems = [ps.stem(token) for token in tokens2 if token.isalnum()]
    return stems

### Construimos dos vector models, uno con bag of words y otro con TF-IDF

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
count_vect = CountVectorizer(tokenizer=tokenize) 
tfidf_vect = TfidfVectorizer(tokenizer=tokenize)

### Bag of words 

In [20]:
import pandas as pd
counts = count_vect.fit_transform(raw_dataset) 
pd.DataFrame(counts.todense(), columns=count_vect.get_feature_names_out())

Unnamed: 0,100,acordarm,ala,balcón,banda,cañon,colgar,corta,cristal,cuyo,...,quiero,sino,tiempo,toda,vela,vez,viento,vivia,volverán,vuela
0,0,1,0,0,0,0,0,0,0,1,...,1,0,1,0,0,0,0,1,0,0
1,1,0,0,0,1,1,0,1,0,0,...,0,1,0,1,1,0,1,0,0,1
2,0,0,1,1,0,0,1,0,1,0,...,0,0,0,0,0,1,0,0,1,0


### TF-IDF

In [21]:
tfidf = tfidf_vect.fit_transform(raw_dataset)
pd.DataFrame(tfidf.todense(), columns=tfidf_vect.get_feature_names_out())

Unnamed: 0,100,acordarm,ala,balcón,banda,cañon,colgar,corta,cristal,cuyo,...,quiero,sino,tiempo,toda,vela,vez,viento,vivia,volverán,vuela
0,0.0,0.353553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.353553,...,0.353553,0.0,0.353553,0.0,0.0,0.0,0.0,0.353553,0.0,0.0
1,0.301511,0.0,0.0,0.0,0.301511,0.301511,0.0,0.301511,0.0,0.0,...,0.0,0.301511,0.0,0.301511,0.301511,0.0,0.301511,0.0,0.0,0.301511
2,0.0,0.0,0.333333,0.333333,0.0,0.0,0.333333,0.0,0.333333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0


### Aplicar el modelo a una nueva sentencia (Bag of word's) 

In [22]:
new_sentence="hablo de golondrinas y balcón"
pd.DataFrame(count_vect.transform([new_sentence]).todense(), columns=count_vect.get_feature_names_out())

Unnamed: 0,100,acordarm,ala,balcón,banda,cañon,colgar,corta,cristal,cuyo,...,quiero,sino,tiempo,toda,vela,vez,viento,vivia,volverán,vuela
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Aplicar el modelo a una nueva sentencia (TF-IDF) 

In [23]:
pd.DataFrame(tfidf_vect.transform([new_sentence]).todense(), columns=tfidf_vect.get_feature_names_out())

Unnamed: 0,100,acordarm,ala,balcón,banda,cañon,colgar,corta,cristal,cuyo,...,quiero,sino,tiempo,toda,vela,vez,viento,vivia,volverán,vuela
0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculamos la similaridad de coseno con las sentencias del dataset

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
query = tfidf_vect.transform([new_sentence])
similarities = cosine_similarity(tfidf, query) 
similarities

array([[0.        ],
       [0.        ],
       [0.47140452]])

### Obtenemos la sentencia de máxima similaridad 

In [25]:
raw_dataset[similarities.argmax()]

'Volverán las oscuras golondrinas en tu balcón sus nidos a colgar, y otra vez con el ala a sus cristales...'

### Usamos el "hashing trick" para controlar la dimension del "feature vector"

In [26]:
from sklearn.feature_extraction.text import HashingVectorizer 
dim_feature_vector = 200
hv = HashingVectorizer(tokenizer=tokenize,n_features=dim_feature_vector,token_pattern=None)
hashing = hv.fit_transform(raw_dataset)
pd.DataFrame(hashing.todense())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.301511,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.333333,0.0,0.0,0.0,0.0,0.0


### Calculamos las similaridades en este modelo

In [27]:
query = hv.transform([new_sentence])
similarities = cosine_similarity(hashing, query)
similarities

array([[0.        ],
       [0.        ],
       [0.38490018]])

### En este caso es muy importante ajustar la dimension del vector. Propuesta: probar con diferentes valores de dim_feature_vector