## Un ejemplo de comparación entre embeddings

miren que lindo latex

$${\displaystyle{
\begin{aligned} min_{Q} D_{\text{KL}}(Q_{t-1}\|Q_{t}) - |Q| \newline
D_{\text{KL}}(Q_{t-1}\|Q_{t}) &= \sum _{i}Q(i) \ln \left({\frac {Q(i)}{Q_{t-1}(i)}}\right)
\end{aligned}}}
Q$$

### Instalamos cosas necesarias e importamos

In [1]:
# instalamos librerias
!python -m spacy download es_core_news_md
!pip install unidecode


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/es_core_news_md -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/es_core_news_md

    You can now load the model via spacy.load('es_core_news_md')



In [0]:
# Si quisieramos usar otros embeddings:
# https://github.com/uchile-nlp/spanish-word-embeddings
#!wget -nc http://dcc.uchile.cl/~jperez/word-embeddings/glove-sbwc.i25.vec.gz
#!python -m spacy init-model es /mindspace --vectors-loc glove-sbwc.i25.vec.gz

In [3]:
# importamos
import numpy as np
import pandas as pd
import unidecode
import spacy
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Definimos un par de ejemplos

In [0]:
# ejemplos
musico = "el video muestra una escena en una playa, se escucha la gente riendo y jugando. El mar está tranquilo, no se escuchan muchas olas, y el cielo despejado. Hay una pareja con un chico, que se alejó para jugar con una pelotita. Hay un grupo de adolescentes lejos hacia la derecha donde uno toca la guitarra."
fotografo = "Se ve un día soleado de verano en una playa, con vistas al horizonte. Hay una pareja acostada sobre una manta verde y violeta de figuras geométricas, y varias sombrillas celestes a la izquierda. A lo lejos se ve el mar, con agua bastante clara. El hombre de la pareja tiene un bigote y aparenta unos 45 años, la mujer está de espaldas."

In [0]:
# Cargamos la data del modulo para español
nlp = spacy.load('es_core_news_md')

In [0]:
# pasamos los strings a un objeto "nlp" de spacy
musico = nlp(musico)
fotografo = nlp(fotografo)

In [7]:
# Este objeto ya trae muchos atributos utiles. Asi accedemos a las priemras palabras
musico[0], fotografo[0]

(el, Se)

In [0]:
# Calcular distancia coseno entre vectores
def similitud_coseno(x, y):
    return x @ y / (np.linalg.norm(x) * np.linalg.norm(y))

In [9]:
# en musico.vector tenemos el word embedding para ese ejempo
similitud_coseno(musico.vector, fotografo.vector)

0.9927855

In [0]:
# descargamos una lista de palabras que no aportan (articulos, etc)
STOP_WORDS = set(stopwords.words('spanish'))

In [0]:
# ahora hacemos una version del ejemplo mas limpia: pasamos todo a minuscula, aplicamos lemmatizacion (ej: de corriamos a correr), y sacamos tildes
musico_clean = nlp(' '.join([unidecode.unidecode(w.string.strip()) for w in musico if w.lemma_.lower() not in STOP_WORDS]))
fotografo_clean = nlp(' '.join([unidecode.unidecode(w.string.strip()) for w in fotografo if w.lemma_.lower() not in STOP_WORDS]))

In [0]:
# vectores de referencia para sonido y vision
sonido = ['música','sonido','melodía','armonía','grave','agudo']
vision = ['fotografía','imagen','color','forma','contorno','linea']

In [13]:
#ejemplos
o1, o2 = 'escuchar sonido violín orquesta', 'miraba reflejo color luna oscuro'
e1, e2 = nlp(o1).vector, nlp(o2).vector

#vectores de referencia
r1, r2 = nlp(' '.join(sonido)).vector, nlp(' '.join(vision)).vector

#distancias
print('|e1 - r1|', similitud_coseno(e1, r1))
print('|e2 - r1|', similitud_coseno(e2, r1))
print('|e1 - r2|', similitud_coseno(e1, r2))
print('|e2 - r2|', similitud_coseno(e2, r2))

|e1 - r1| 0.855635
|e2 - r1| 0.75515664
|e1 - r2| 0.6642258
|e2 - r2| 0.86068857


Qué estamos haciendo?

El objeto nlp de SpaCy nos permite tener muchos atributos de las palabras fácilmente. 
Generamos uno de estos con el string a analizar:
`var = nlp('palabra')`
Ahora tenemos en `var.vector` su embedding, y muchas otras cosas como `var.lemma_` que es su versión normalizada.


In [14]:
# vemos que los que entrenaron esto son medio tontos, y no quitaron tildes...
nlp('música').similarity(nlp('musica'))

0.6437335591281158

In [0]:
len(nlp.vocab)

### De acá en adelante no le den bola por ahora

In [0]:
#!pip install "dask[complete]"

In [0]:
vecs = [w.vector for w in nlp.vocab]

In [0]:
full = pd.DataFrame(vecs, index = [w.text for w in nlp.vocab])

In [0]:
del vecs

In [0]:
full = full[(full.T != 0).any()]

In [20]:
full.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
Cruzados,0.189266,0.769406,-0.317511,-0.143724,0.904029,-0.798096,0.134045,0.39049,0.200463,1.155759,...,0.175721,0.784642,-0.540075,-0.783384,1.106435,-0.485407,-0.594081,-1.450064,0.526385,-0.6365
EXELENTE,-0.353908,-0.426671,-0.388342,-0.417906,-0.554919,-0.182105,-0.061681,0.127385,-0.30033,-0.016929,...,-0.522508,-0.143722,-0.029164,0.100764,0.302991,-0.144569,-0.208229,0.053257,-0.338351,0.040218
arrebatos,-0.361051,-0.582987,0.232902,-0.243333,-0.105624,-0.157502,-0.436084,0.227819,-0.39431,1.630682,...,0.426599,0.566566,0.260469,-0.245371,0.077383,-0.606782,-0.212723,0.233458,0.258994,-0.344367
conveni,0.069903,-0.151023,0.034627,0.029339,-0.011479,-0.154196,0.071619,0.020375,0.018038,-0.079844,...,0.09331,0.085104,0.02731,0.034813,-0.018784,0.051851,0.067418,0.07383,0.017438,0.024746
sumas,-0.058917,0.245569,0.860402,0.359576,-0.018996,0.406509,0.057397,0.234608,0.387314,1.882302,...,1.42587,-0.225105,-0.455502,0.491791,0.701733,-0.541119,0.510187,-2.056892,0.82682,0.269376


In [21]:
# https://github.com/lmcinnes/umap
!pip install umap-learn



In [0]:
import umap
from sklearn.decomposition import PCA 
#tsne de scikit tarda demasiado

In [23]:
%%time
pca = PCA(n_components=3,
            random_state=440)

embedding1 = pca.fit_transform(full.values)
embedding_df1 = pd.DataFrame(embedding1, columns=['x1', 'y1', 'z1'])

CPU times: user 6.12 s, sys: 2.18 s, total: 8.31 s
Wall time: 4.6 s


In [24]:
embedding_df1.sample(5)

Unnamed: 0,x1,y1,z1
177829,-0.674656,0.569644,-0.919855
414659,1.25577,-0.531006,-1.610488
381274,0.717581,-0.748439,-1.423003
268416,-1.074318,-0.304286,-0.033311
180122,2.051577,0.829329,-0.803018


In [0]:
%%time
# ver https://github.com/lmcinnes/umap/blob/master/notebooks/UMAP usage and parameters.ipynb

umap_ = umap.UMAP(n_components=3)
embedding2 = umap_.fit_transform(full.values)
embedding_df2 = pd.DataFrame(embedding2, columns = ['x2', 'y2', 'z2'])

In [0]:
embedding_df = pd.concat(embedding_df1,embedding_df2)
embedding_df.index = full.index

In [0]:
embedding_df.sample(5)

In [0]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [0]:
scatter = go.Scatter3d(
            name=embedding_df.index,
            x=embedding_df['x1'],
            y=embedding_df['y1'],
            z=embedding_df['z1'],
            text=embedding_df.index,
            textposition='middle-center',
            showlegend=False,
            mode=plot_mode,
            marker=dict(
                size=3,
                color='#ED9C69',
                symbol='circle'
            )
        )


In [0]:
iplot(scatter)