![agents](images/header.jpg)
# Análisis semántico
### Ramón Soto C. [(rsotoc@moviquest.com)](mailto:rsotoc@moviquest.com/)
[ver en nbviewer](http://nbviewer.ipython.org/github/rsotoc/nlp/blob/master/6.%20Modelado%20de%20temas%3A%20DLA.ipynb)

![ ](images/blank.png)
## Definición

La semántica es, de acuerdo con el [DRAE](http://dle.rae.es/?id=XVRDns5), la...

> Disciplina que estudia el significado de las unidades lingüísticas y de sus combinaciones.

El análisis semántico consiste en analizar el significado de un conjunto de palabras, símbolos y frases, en un contexto específico, con el fin de determinar el mensaje contenido en el texto. 

![](images/nlp02e.png)

En el caso del procesamiento de lenguaje natural, la etapa de análisis semántico toma el flujo de tokens provenientes del análisis léxico, posiblemente categorizadas en los pasos previos, para generar una interpretación del texto.

Sin embargo, tratar de interpretar un mensaje de texto, escrito con muchas libertades, es un problema sumamente complejo aún cuando el texto se encuentre construido de forma correcta. Considérese, por ejemplo, el siguiente monólogo clásico de *Groucho Marx* (*Animal Crackers*, 1930):

> *One morning I shot an elephant in my pajamas. <br>
> How he got in my pajamas, I don't know.* ![](images/groucho.jpg)

Aunque la situación es utilizada como broma, particularmente al ser forzada por Groucho Marx, la estructura es sintácticamente correcta. La interpretación "lógica" se obtiene integrando diversos elementos no disponibles en el texto, particularmente los siguientes: 1) Un elefante no cabe en unas piyamas humanas, y 2) El autor está bromeando.

El alcance del análisis semántico automatizado de lenguajes naturales se limita a tareas específicas, destacándose las siguientes:

* Sistemas de traducción
* Sistemas preguntadores/respondedores
* Sistemas resumidores
* Sistemas de correción ortográfica
* Identificación temática
* Análisis de sentimientos

Siendo las dos últimas las más activas actualmente.

## Modelado de tópicos

El modelado de tópicos es una tarea cuyo objetivo es identificar tópicos a partir de una colección de documentos (un corpus). Usualmente, toda la información que se tiene de un corpus es la distribución de palabras utilizadas y, posiblemente, el número de clases/tópicos que se quieren identificar. Para realizar la identificación de los tópicos en una colección de documentos se utilizan técnicas especializadas de aprendizaje no supervisado (*clustering*). Los modelos resultantes suelen definirse mediante funciones de distribución de probabilidades sobre el conjunto de palabras utilizadas en el corpus. 

Existen diferentes aproximaciones para realizar el modelado de tópicos, siendo las más sobresalientes las técnicas **PLSI** (*probabilistic latent semantic indexing* - indexado probabilístico de semántica latente), **NNMF** (*non-negative matrix factorization* -  factorización de matrices no negativas) y **LDA** (*Latent Dirichlet Allocation* - Asignación latente de Dirichlet).


### Factorización de matrices no negativas (NNMF/NMF)

Una forma muy común de representar una colección de documentos es mediante una **matriz documentos-términos**. En una matriz documentos-términos los renglones representan documentos y las columnas corresponden a los términos. Las celdas suelen contener medidas de importancia de cada término en un documento dado, por ejemplo, mediante valores de tf-idf. Una dificultad al utilizar estas matrices para representar la colección de documentos es la cantidad de atributos utilizados.

El objetivo de la técnica NMF es obtener una factorización de la matriz original $\mathbf{V}$ en dos matrices de menor rango $\mathbf{W}$ y $\mathbf{H}$. Dado que el problema no tiene solución exacta para el caso general, lo usual es obtener una solución aproximada:

![](images/nmf.png)

generar un conjunto reducido de nuevos rasgos, a partir de la matriz original de documentos-términos. 



Just as its name suggests, matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more) matrices such that when you multiply them you will get back the original matrix.

NMF takes as an input a term-document matrix and generates a set of topics that represent weighted sets of co-occurring terms. The discovered topics form a basis that provides an efficient representation of the original documents.



Another way to think about it is that NMF breaks your original data features (let's call it V) into the product of two lower ranked matrices (let's call it W and H). NMF uses an iterative approach to modify the initial values of W and H so that the product approaches V. When the approximation error converges or the user-defined number of iterations is reached, NMF terminates.



In [39]:
from IPython.display import display
import pandas as pd
import numpy as np 
pd.options.display.max_colwidth = 150 

import nltk
import re

In [38]:
import json

file = 'Data Sets/Comics/lexicon_comics.json'
with open(file) as comics_file:
    dict_comics = json.load(comics_file)

comicsDf = pd.DataFrame.from_dict(dict_comics).reindex_axis(
    ['name',"description", "main_words", "new_description"], axis=1)

display(comicsDf.head(5))

Unnamed: 0,name,description,main_words,new_description
0,'Mazing Man,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci...",mazing_man man title_character comic_book series_created bob_rozakis stephen published dc_comics series ran twelve_issues additional_special_issue...
1,711 (Quality Comics),is a fictional superhero from the golden age of comics he was created by george brenner and published by quality comics first appeared in police c...,"[fictional, superhero, golden, age, comics, created, george, published, quality, comics, first, appeared, police, comics, august, lasted, january,...",711_quality_comics fictional superhero golden_age comics_created george_published quality_comics first_appeared police_comics_august lasted januar...
2,Abigail Brand,special agent special agent abigail brand is a fictional character appearing in american comic book s published by marvel comics publication histo...,"[special, agent, special, agent, abigail, brand, fictional, character, appearing, american, comic, book, published, marvel, comics, publication, h...",abigail_brand special agent special agent abigail_brand fictional character_appearing american_comic_book published marvel_comics publication hist...
3,Abin Sur,abin sur is a fictional character and a superhero from the dc comics dc universe he was a member of the green lantern corps and is best known as t...,"[abin, sur, fictional, character, superhero, dc, comics, dc, universe, member, green, lantern, corps, best, known, predecessor, green, lantern, ha...",abin_sur abin sur fictional character_superhero dc_comics dc_universe member green_lantern corps best_known predecessor green_lantern hal_jordan a...
4,Abner Jenkins,abner ronald jenkins formerly known as the beetle comics beetle mach iv mach v mach vii and currently known as mach x and is a fictional character...,"[abner, ronald, jenkins, formerly, known, beetle, comics, beetle, mach, mach, mach, vii, currently, known, mach, x, fictional, character, appearin...",abner_jenkins abner ronald jenkins formerly_known beetle_comics beetle mach mach mach_vii currently known mach x_fictional character_appearing ame...


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(comicsDf.new_description)
X_array = X.toarray()
X_vocab = np.array(vectorizer.get_feature_names())

In [4]:
#print(X.shape, "\n", X_array[2, 0:100], "\n", X_vocab[0:100])
for i, x in zip(range(len(X_vocab)), X_array[2]):
    if x > 5:
        print(i, x, X_vocab[i])

57 6 abigail
59 14 abigail_brand
4444 7 appears
5633 8 astonishing_x_men
7722 10 beast
10584 46 brand
10651 12 breakworld
23495 10 earth
50426 6 mccoy
53323 6 mutant
58203 7 ord
66824 6 revealed
78092 25 sword
89478 11 x_men


In [19]:
from sklearn import decomposition

num_topics = 6
num_top_words = 100
clf = decomposition.NMF(n_components=num_topics, random_state=1)
doctopic = clf.fit_transform(X)

In [20]:
topic_words = []
for topic in clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([X_vocab[i] for i in word_idx])

In [21]:
for row in topic_words:
    print(row, "\n")

['batman', 'nightwing', 'robin', 'damian', 'batgirl', 'dick', 'joker', 'dick_grayson', 'character', 'anarky', 'bruce', 'dc_comics', 'catwoman', 'jason', 'series', 'barbara_gordon', 'appears', 'gotham', 'justice_league', 'also', 'father', 'terry', 'bruce_wayne', 'new', 'death', 'selina', 'role', 'grayson', 'batman_robin', 'oracle', 'todd', 'batwoman', 'voiced', 'following', 'teen_titans', 'one', 'however', 'son', 'gotham_city', 'costume', 'version', 'time', 'two', 'later', 'story', 'red_hood', 'jason_todd', 'dc', 'revealed', 'birds', 'detective_comics', 'although', 'order', 'end', 'identity', 'tim', 'nightwing_vol', 'damian_wayne', 'killed', 'comics', 'game', 'would', 'tim_drake', 'part', 'episode', 'made', 'harley_quinn', 'kill', 'events', 'return', 'returns', 'prey', 'barbara', 'daughter', 'battle', 'family', 'file', 'talia', 'batman_beyond', 'life', 'team', 'become', 'takes', 'red_robin', 'film', 'green_arrow', 'comic_book', 'continuity', 'left', 'thumb', 'circus', 'first', 'villain'

In [8]:
print(doctopic.shape)

(1867, 6)


In [9]:
doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

  if __name__ == '__main__':


In [10]:
page_titles = np.asarray(list(comicsDf.name))

doctopic_orig = doctopic.copy()
num_groups = len(set(page_titles))
doctopic_grouped = np.zeros((num_groups, num_topics))


for i, name in enumerate(sorted(set(page_titles))):
    doctopic_grouped[i, :] = np.mean(doctopic[page_titles == name, :], axis=0)
    
doctopic = doctopic_grouped

print(doctopic)

[[ 0.31599468  0.          0.05157035  0.          0.12851462  0.50392035]
 [ 0.13392905  0.          0.02166091  0.02840468  0.16993469  0.64607067]
 [ 0.          0.38255308  0.07975503  0.0456573   0.          0.49203458]
 ..., 
 [ 0.01535451  0.02111365  0.00742057  0.00203582  0.00461344  0.94946201]
 [ 0.0159539   0.00909364  0.1671601   0.          0.01801538  0.78977698]
 [ 0.02020925  0.03463927  0.02296386  0.04837711  0.50871322  0.36509729]]


In [11]:
nmf = pd.DataFrame(data=doctopic, index=page_titles, 
                   columns=["T1", "T2", "T3", "T4", "T5", "T6"])

In [12]:
display(nmf[250:300])

Unnamed: 0,T1,T2,T3,T4,T5,T6
Captain Boomerang,0.227167,0.036766,0.062196,0.092536,0.058436,0.522898
Captain Britain,0.0,0.198381,0.100846,0.024255,0.0,0.676518
Captain Carrot,0.056014,0.0,0.0,0.067592,0.156617,0.719777
Captain Comet,0.058695,0.044396,0.0,0.136902,0.221042,0.538964
Captain Dynamo (comics),0.055814,0.12448,0.05251,0.008488,0.147229,0.611479
Captain Marvel (DC Comics),0.009909,0.0,0.0,0.003247,0.367814,0.61903
Captain Marvel (Khn'nr),0.0,0.030887,0.0,0.0,0.028714,0.940399
Captain Marvel (Marvel Comics),0.0,0.0,0.015743,0.0,0.094394,0.889863
Captain Marvel Jr.,0.0,0.0,0.0,0.014005,0.162557,0.823438
Captain Midlands,0.012794,0.047598,0.023887,0.0,0.015295,0.900427


## Asignación latente de Dirichlet

In [37]:
import re

texto = re.sub("[^\w*]", " ", "Este es un ejemplo $123 @xy45%")
texto = re.sub("[\d]", " ", texto)
print (texto)

Este es un ejemplo       xy   


In [None]:
new_main_words = []
for row in comicsDf.new_description:
    new_main_words.extend(row.split())
common_new_main_words = nltk.FreqDist(new_main_words)
uselessWords = common_new_main_words.hapaxes()
print("Cantidad de palabras sin utilidad: ", len(uselessWords))

In [None]:
for i, row in zip(range(len(comicsDf)), comicsDf.new_description):
    for w in uselessWords:
        row = re.sub(" " + w[0], " " + "_".join(w[0].split()), row)
    comicsDf.loc[i, "new_description"] = n.lower() + s


print("Cantidad de tokens en el corpus: ", len(common_new_main_words.most_common()))

In [None]:
print(uselessWords[1100:1200])