![agents](images/header.jpg)
# Análisis léxico: Colocaciones y $N$-Gramas
### Ramón Soto C. [(rsotoc@moviquest.com)](mailto:rsotoc@moviquest.com/)
[ver en nbviewer](http://nbviewer.ipython.org/github/rsotoc/nlp/blob/master/Introducción.ipynb)



## Modelo de N-Gramas

Una característica de los lenguajes naturales es que las frases que los componen no tienen una distribución uniforme. Por el contrario, existen construcciones que son más comunes que otras. De esta manera, aunque las frases $f_1 = \textrm{"el perro camina"}$ y $f_2 = \textrm{"el perro vuela"}$ son ambas correctas sintácticamente, es más probable encontrar la frase $f_1$ que la frase $f_2$ en un texto arbitrario. 

En la película "*Take The Money And Run*", Virgil Starkwell (Woody Allen) intenta asaltar un banco y entrega al cajero una nota con el mensaje "*Please put fifty thousand dollars into this bag and act natural as I am pointing a gun at you*" que es leída por los empleados del banco como "*Please put fifty thousand dollars into this bag and ABT natural as I am pointing a GUB at you*". 

[![](images/i_have_a_gub.jpg)](https://www.youtube.com/watch?v=pEm0zi8QrpA)

Sin embargo, es obvio que la frase "*I am pointing a GUN at you*" es más probable que la frase "*I am pointing a GUB at you*", por lo que en la vida real no nos cuesta trabajo reconocer la frase correcta. Para modelar esta capacidad de predecir la ocurrencia de una palabra en una frase se utilizan **Modelos de lenguajes** que asignan probabilidades a las secuencias de palabras que pueden conformar un texto. 

El modelo más simple es el **Modelo de N-Gramas**". Este modelo asume que la probabiliad de ocurrencia de una palabra está determinada por las palabras recientes; lo que se conoce como la **suposición de Markov**. De manera que para el cálculo de estas probabilidades basta contabilizar la ocurrencia de secuencias de palabras de longitud definida. Un **$N$-Grama** es una secuencia de $N$ palabras. Así, por ejemplo, un 2-grama (o bigrama) es una secuencia de 2 palabras, como "*el hombre*", "*hombre camina*", "*camina en*", "*en el*", "*el parque*". Un 3-grama (trigrama) es una secuencia de tres palabras, como "*el hombre camina*", "*hombre camina en*", "*camina en el*", "*en el parque*". 

In [1]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "los amigos toman cafe. el perro duerme en el parque. el hombre con el perro \
camina en el parque con amigos."

token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
tetragrams = ngrams(token,4)

bigrams_list = list(bigrams)
counter_bigrams = Counter(bigrams_list)
print ("Bigramas:\n", counter_bigrams)

trigrams_list = list(trigrams)
counter_trigrams = Counter(trigrams_list)
print ("\nTrigramas:\n", counter_trigrams)

tetragrams_list = list(tetragrams)
counter_tetragrams = Counter(tetragrams_list)
print ("\nTetragramas:\n", counter_tetragrams)

Bigramas:
 Counter({('.', 'el'): 2, ('el', 'perro'): 2, ('en', 'el'): 2, ('el', 'parque'): 2, ('los', 'amigos'): 1, ('amigos', 'toman'): 1, ('toman', 'cafe'): 1, ('cafe', '.'): 1, ('perro', 'duerme'): 1, ('duerme', 'en'): 1, ('parque', '.'): 1, ('el', 'hombre'): 1, ('hombre', 'con'): 1, ('con', 'el'): 1, ('perro', 'camina'): 1, ('camina', 'en'): 1, ('parque', 'con'): 1, ('con', 'amigos'): 1, ('amigos', '.'): 1})

Trigramas:
 Counter({('en', 'el', 'parque'): 2, ('los', 'amigos', 'toman'): 1, ('amigos', 'toman', 'cafe'): 1, ('toman', 'cafe', '.'): 1, ('cafe', '.', 'el'): 1, ('.', 'el', 'perro'): 1, ('el', 'perro', 'duerme'): 1, ('perro', 'duerme', 'en'): 1, ('duerme', 'en', 'el'): 1, ('el', 'parque', '.'): 1, ('parque', '.', 'el'): 1, ('.', 'el', 'hombre'): 1, ('el', 'hombre', 'con'): 1, ('hombre', 'con', 'el'): 1, ('con', 'el', 'perro'): 1, ('el', 'perro', 'camina'): 1, ('perro', 'camina', 'en'): 1, ('camina', 'en', 'el'): 1, ('el', 'parque', 'con'): 1, ('parque', 'con', 'amigos'): 1, (

### $N$-Gramas en la base de datos de personajes de comics

In [2]:
from IPython.display import display
import pandas as pd
pd.options.display.max_colwidth = 150 

In [3]:
import json

file = 'Data Sets/Comics/clean_comics.json'
with open(file) as comics_file:
    dict_comics = json.load(comics_file)

comicsDf = pd.DataFrame.from_dict(dict_comics)

display(comicsDf.head(1))

Unnamed: 0,description,main_words,name
0,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci...",'Mazing Man


In [4]:
#Sólo para reducir mi OCD
comicsDf = comicsDf.reindex_axis(['name',"description", "main_words"], axis=1)
display(comicsDf.head(1))

Unnamed: 0,name,description,main_words
0,'Mazing Man,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci..."


In [5]:
comicsDf["bigrams"] = list(map(lambda row: list(ngrams(word_tokenize(row),2)), 
                                   comicsDf.description))
display(comicsDf.head())

comics_bigrams = []
for row in comicsDf.bigrams:
    comics_bigrams.extend(row)
most_common_comics_bigrams = nltk.FreqDist(comics_bigrams)

print("Cantidad de bigramas en el corpus: ", most_common_comics_bigrams.N())
print("\nBigramas más populares:\n", most_common_comics_bigrams.most_common(50))

  if __name__ == '__main__':


Unnamed: 0,name,description,main_words,bigrams
0,'Mazing Man,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci...","[(mazing, man), (man, is), (is, the), (the, title), (title, character), (character, of), (of, a), (a, comic), (comic, book), (book, series), (seri..."
1,711 (Quality Comics),is a fictional superhero from the golden age of comics he was created by george brenner and published by quality comics first appeared in police c...,"[fictional, superhero, golden, age, comics, created, george, published, quality, comics, first, appeared, police, comics, august, lasted, january,...","[(is, a), (a, fictional), (fictional, superhero), (superhero, from), (from, the), (the, golden), (golden, age), (age, of), (of, comics), (comics, ..."
2,Abigail Brand,special agent special agent abigail brand is a fictional character appearing in american comic book s published by marvel comics abigail brand s f...,"[special, agent, special, agent, abigail, brand, fictional, character, appearing, american, comic, book, published, marvel, comics, abigail, brand...","[(special, agent), (agent, special), (special, agent), (agent, abigail), (abigail, brand), (brand, is), (is, a), (a, fictional), (fictional, chara..."
3,Abin Sur,abin sur is a fictional character and a superhero from the dc comics dc universe universe he was a member of the green lantern corps and is best k...,"[abin, sur, fictional, character, superhero, dc, comics, dc, universe, universe, member, green, lantern, corps, best, known, predecessor, green, l...","[(abin, sur), (sur, is), (is, a), (a, fictional), (fictional, character), (character, and), (and, a), (a, superhero), (superhero, from), (from, th..."
4,Abner Jenkins,abner ronald jenkins formerly known as the beetle comics beetle mach mach mach mach iv mach v mach vii and currently known as mach x and is a fict...,"[abner, ronald, jenkins, formerly, known, beetle, comics, beetle, mach, mach, mach, mach, mach, mach, vii, currently, known, mach, x, fictional, c...","[(abner, ronald), (ronald, jenkins), (jenkins, formerly), (formerly, known), (known, as), (as, the), (the, beetle), (beetle, comics), (comics, bee..."


Cantidad de bigramas en el corpus:  3391034

Bigramas más populares:
 [(('of', 'the'), 26955), (('in', 'the'), 22292), (('x', 'men'), 8457), (('to', 'the'), 8214), (('and', 'the'), 7985), (('as', 'a'), 7543), (('with', 'the'), 6988), (('marvel', 'comics'), 6206), (('by', 'the'), 5661), (('he', 'is'), 5238), (('in', 'a'), 5209), (('spider', 'man'), 4942), (('to', 'be'), 4916), (('is', 'a'), 4811), (('from', 'the'), 4394), (('on', 'the'), 4182), (('as', 'the'), 4138), (('of', 'his'), 4031), (('dc', 'comics'), 3906), (('he', 'was'), 3862), (('at', 'the'), 3711), (('one', 'of'), 3458), (('during', 'the'), 3429), (('for', 'the'), 3412), (('the', 'character'), 3361), (('the', 'new'), 3318), (('comic', 'book'), 3280), (('that', 'he'), 3252), (('of', 'a'), 3147), (('the', 'x'), 3097), (('she', 'is'), 3048), (('justice', 'league'), 2992), (('the', 'team'), 2818), (('appeared', 'in'), 2806), (('member', 'of'), 2753), (('appears', 'in'), 2670), (('him', 'to'), 2460), (('version', 'of'), 2456), ((

## Colocaciones

En lexicología, una colocación es un término que designa combinaciones de unidades léxicas  que se distinguen por un frecuencia que resulta inesperadamente alta. En los bigramas anteriores destacan los siguientes:

<code>[... (('x', 'men'), 8457), ... (('marvel', 'comics'), 6206), ... (('spider', 'man'), 4942), ... (('dc', 'comics'), 3906), ... (('comic', 'book'), 3280), ... (('justice', 'league'), 2991), ... (('green', 'lantern'), 2288), ...]
</code>

Estos términos permiten hacer un análisis más inteligente del texto. Identificamos otras colocaciones importantes:

In [6]:
comicsDf["clean_bigrams"] = list(map(lambda words, bigrams: 
                                        [b for b in list(ngrams(words, 2))
                                        if b in bigrams], 
                                        comicsDf.main_words, comicsDf.bigrams))
display(comicsDf.head())

clean_bigrams = []
for row in comicsDf.clean_bigrams:
    clean_bigrams.extend(row)
common_clean_bigrams = nltk.FreqDist(clean_bigrams)

print("Cantidad de bigramas en el corpus: ", common_clean_bigrams.N())
print("\nBigramas más populares:\n", common_clean_bigrams.most_common(50))

  from ipykernel import kernelapp as app


Unnamed: 0,name,description,main_words,bigrams,clean_bigrams
0,'Mazing Man,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci...","[(mazing, man), (man, is), (is, the), (the, title), (title, character), (character, of), (of, a), (a, comic), (comic, book), (book, series), (seri...","[(title, character), (comic, book), (book, series), (series, created), (bob, rozakis), (dc, comics), (series, ran), (twelve, issues), (additional,..."
1,711 (Quality Comics),is a fictional superhero from the golden age of comics he was created by george brenner and published by quality comics first appeared in police c...,"[fictional, superhero, golden, age, comics, created, george, published, quality, comics, first, appeared, police, comics, august, lasted, january,...","[(is, a), (a, fictional), (fictional, superhero), (superhero, from), (from, the), (the, golden), (golden, age), (age, of), (of, comics), (comics, ...","[(fictional, superhero), (golden, age), (quality, comics), (comics, first), (first, appeared), (police, comics), (comics, august), (killed, daniel..."
2,Abigail Brand,special agent special agent abigail brand is a fictional character appearing in american comic book s published by marvel comics abigail brand s f...,"[special, agent, special, agent, abigail, brand, fictional, character, appearing, american, comic, book, published, marvel, comics, abigail, brand...","[(special, agent), (agent, special), (special, agent), (agent, abigail), (abigail, brand), (brand, is), (is, a), (a, fictional), (fictional, chara...","[(special, agent), (agent, special), (special, agent), (agent, abigail), (abigail, brand), (fictional, character), (character, appearing), (americ..."
3,Abin Sur,abin sur is a fictional character and a superhero from the dc comics dc universe universe he was a member of the green lantern corps and is best k...,"[abin, sur, fictional, character, superhero, dc, comics, dc, universe, universe, member, green, lantern, corps, best, known, predecessor, green, l...","[(abin, sur), (sur, is), (is, a), (a, fictional), (fictional, character), (character, and), (and, a), (a, superhero), (superhero, from), (from, th...","[(abin, sur), (fictional, character), (dc, comics), (comics, dc), (dc, universe), (universe, universe), (green, lantern), (lantern, corps), (best,..."
4,Abner Jenkins,abner ronald jenkins formerly known as the beetle comics beetle mach mach mach mach iv mach v mach vii and currently known as mach x and is a fict...,"[abner, ronald, jenkins, formerly, known, beetle, comics, beetle, mach, mach, mach, mach, mach, mach, vii, currently, known, mach, x, fictional, c...","[(abner, ronald), (ronald, jenkins), (jenkins, formerly), (formerly, known), (known, as), (as, the), (the, beetle), (beetle, comics), (comics, bee...","[(abner, ronald), (ronald, jenkins), (jenkins, formerly), (formerly, known), (beetle, comics), (comics, beetle), (beetle, mach), (mach, mach), (ma..."


Cantidad de bigramas en el corpus:  969524

Bigramas más populares:
 [(('x', 'men'), 8457), (('marvel', 'comics'), 6212), (('spider', 'man'), 4942), (('dc', 'comics'), 3909), (('comic', 'book'), 3281), (('justice', 'league'), 2992), (('green', 'lantern'), 2289), (('captain', 'america'), 2008), (('first', 'appeared'), 1462), (('iron', 'man'), 1462), (('tv', 'series'), 1437), (('fantastic', 'four'), 1410), (('uncanny', 'x'), 1326), (('teen', 'titans'), 1236), (('wonder', 'woman'), 1179), (('super', 'heroes'), 1102), (('fictional', 'character'), 993), (('captain', 'marvel'), 986), (('marvel', 'universe'), 984), (('new', 'york'), 975), (('new', 'mutants'), 884), (('civil', 'war'), 853), (('x', 'factor'), 819), (('x', 'force'), 813), (('dc', 'universe'), 799), (('lantern', 'corps'), 787), (('american', 'comic'), 781), (('ghost', 'rider'), 765), (('playable', 'character'), 753), (('amazing', 'spider'), 741), (('justice', 'society'), 726), (('green', 'arrow'), 678), (('golden', 'age'), 649), 

En estos resultados observamos que todos los bigramas presentados tienen significado útil y algunos como <code>'x', 'men'</code>, <code>'spider', 'man'</code> y muchos otros, pierden sentido si se les separa. En estos casos, puede ser conveniente reemplazar el bigrama por un token: <code>'x_men'</code>, <code>'spider_man'</code>, etc. 

In [7]:
comicsDf["all_collocations"] = list(map(lambda doc, bigrams: 
                            [b[0]+" "+b[1] for b in bigrams if b[0]+" "+b[1] in doc], 
                            comicsDf.description, comicsDf.clean_bigrams))
display(comicsDf.head())

Unnamed: 0,name,description,main_words,bigrams,clean_bigrams,all_collocations
0,'Mazing Man,mazing man is the title character of a comic book series created by bob rozakis and stephen destefano and published by dc comics the series ran fo...,"[man, title, character, comic, book, series, created, bob, rozakis, stephen, published, dc, comics, series, ran, twelve, issues, additional, speci...","[(mazing, man), (man, is), (is, the), (the, title), (title, character), (character, of), (of, a), (a, comic), (comic, book), (book, series), (seri...","[(title, character), (comic, book), (book, series), (series, created), (bob, rozakis), (dc, comics), (series, ran), (twelve, issues), (additional,...","[title character, comic book, book series, series created, bob rozakis, dc comics, series ran, twelve issues, additional special, special issues, ..."
1,711 (Quality Comics),is a fictional superhero from the golden age of comics he was created by george brenner and published by quality comics first appeared in police c...,"[fictional, superhero, golden, age, comics, created, george, published, quality, comics, first, appeared, police, comics, august, lasted, january,...","[(is, a), (a, fictional), (fictional, superhero), (superhero, from), (from, the), (the, golden), (golden, age), (age, of), (of, comics), (comics, ...","[(fictional, superhero), (golden, age), (quality, comics), (comics, first), (first, appeared), (police, comics), (comics, august), (killed, daniel...","[fictional superhero, golden age, quality comics, comics first, first appeared, police comics, comics august, killed daniel, district attorney, ex..."
2,Abigail Brand,special agent special agent abigail brand is a fictional character appearing in american comic book s published by marvel comics abigail brand s f...,"[special, agent, special, agent, abigail, brand, fictional, character, appearing, american, comic, book, published, marvel, comics, abigail, brand...","[(special, agent), (agent, special), (special, agent), (agent, abigail), (abigail, brand), (brand, is), (is, a), (a, fictional), (fictional, chara...","[(special, agent), (agent, special), (special, agent), (agent, abigail), (abigail, brand), (fictional, character), (character, appearing), (americ...","[special agent, agent special, special agent, agent abigail, abigail brand, fictional character, character appearing, american comic, comic book, ..."
3,Abin Sur,abin sur is a fictional character and a superhero from the dc comics dc universe universe he was a member of the green lantern corps and is best k...,"[abin, sur, fictional, character, superhero, dc, comics, dc, universe, universe, member, green, lantern, corps, best, known, predecessor, green, l...","[(abin, sur), (sur, is), (is, a), (a, fictional), (fictional, character), (character, and), (and, a), (a, superhero), (superhero, from), (from, th...","[(abin, sur), (fictional, character), (dc, comics), (comics, dc), (dc, universe), (universe, universe), (green, lantern), (lantern, corps), (best,...","[abin sur, fictional character, dc comics, comics dc, dc universe, universe universe, green lantern, lantern corps, best known, green lantern, lan..."
4,Abner Jenkins,abner ronald jenkins formerly known as the beetle comics beetle mach mach mach mach iv mach v mach vii and currently known as mach x and is a fict...,"[abner, ronald, jenkins, formerly, known, beetle, comics, beetle, mach, mach, mach, mach, mach, mach, vii, currently, known, mach, x, fictional, c...","[(abner, ronald), (ronald, jenkins), (jenkins, formerly), (formerly, known), (known, as), (as, the), (the, beetle), (beetle, comics), (comics, bee...","[(abner, ronald), (ronald, jenkins), (jenkins, formerly), (formerly, known), (beetle, comics), (comics, beetle), (beetle, mach), (mach, mach), (ma...","[abner ronald, ronald jenkins, jenkins formerly, formerly known, beetle comics, comics beetle, beetle mach, mach mach, mach mach, mach mach, mach ..."


In [8]:
clean_collocations = []
for row in comicsDf.all_collocations:
    clean_collocations.extend(row)
common_clean_collocations = nltk.FreqDist(clean_collocations)

print("Cantidad de bigramas en el corpus: ", common_clean_collocations.N())
print("\nBigramas más populares:\n", common_clean_collocations.most_common(50))

Cantidad de bigramas en el corpus:  969524

Bigramas más populares:
 [('x men', 8457), ('marvel comics', 6212), ('spider man', 4942), ('dc comics', 3909), ('comic book', 3281), ('justice league', 2992), ('green lantern', 2289), ('captain america', 2008), ('first appeared', 1462), ('iron man', 1462), ('tv series', 1437), ('fantastic four', 1410), ('uncanny x', 1326), ('teen titans', 1236), ('wonder woman', 1179), ('super heroes', 1102), ('fictional character', 993), ('captain marvel', 986), ('marvel universe', 984), ('new york', 975), ('new mutants', 884), ('civil war', 853), ('x factor', 819), ('x force', 813), ('dc universe', 799), ('lantern corps', 787), ('american comic', 781), ('ghost rider', 765), ('playable character', 753), ('amazing spider', 741), ('justice society', 726), ('green arrow', 678), ('golden age', 649), ('world war', 635), ('united states', 616), ('alpha flight', 615), ('video game', 613), ('men vol', 603), ('black panther', 599), ('comic books', 593), ('limited ser

In [9]:
#colocaciones únicas
nc = len(common_clean_collocations.most_common())
print(nc, "colocaciones\n", 
      common_clean_collocations.most_common(20),
      "\n...\n",
      list(common_clean_collocations.most_common())[int(nc/500):int(nc/500)+20],
      "\n...\n",
      list(common_clean_collocations.most_common())[nc-20:])

408004 colocaciones
 [('x men', 8457), ('marvel comics', 6212), ('spider man', 4942), ('dc comics', 3909), ('comic book', 3281), ('justice league', 2992), ('green lantern', 2289), ('captain america', 2008), ('first appeared', 1462), ('iron man', 1462), ('tv series', 1437), ('fantastic four', 1410), ('uncanny x', 1326), ('teen titans', 1236), ('wonder woman', 1179), ('super heroes', 1102), ('fictional character', 993), ('captain marvel', 986), ('marvel universe', 984), ('new york', 975)] 
...
 [('mansion xavier', 75), ('timber wolf', 75), ('nightwing vol', 75), ('took place', 74), ('vol green', 74), ('sinister six', 74), ('titans east', 74), ('outsiders comics', 74), ('thor vol', 74), ('supergirl kara', 74), ('black hood', 74), ('blue marvel', 74), ('marvel boy', 74), ('books golden', 73), ('long enough', 73), ('three issue', 73), ('infinity gauntlet', 73), ('come comics', 73), ('new teen', 73), ('team member', 73)] 
...
 [('however zor', 1), ('tied together', 1), ('suggested zor', 1), 

A partir de este listado, con las 20 colocaciones más frecuentes, las 20 colocaciones menos frecuentes y 20 colocaciones con frecuencia media, observamos que no todas pueden tomarse como tokens, e incluso, la mayoría no tienen sentido como unidades. Construimos entonces las colocaciones tomando en cuenta el diccionario preliminar, reflejado en la columna <code>main_words</code> y mantenemos el nombre del personaje:

In [12]:
import math

comics_all_collocations = list(common_clean_collocations.keys())
dict_all_collocations = dict(zip(comics_all_collocations, [0]*nc))
for w in comics_all_collocations:
    for d in comicsDf.description:
        if(w in d):
            dict_all_collocations[w] += 1

top_collocations = []
numDocs_comics = len(comicsDf)
for d in comicsDf.description:
    N = len(d.split()) - 1 #El número de bigramas es le núymero de palabras - 1
    for w in reversed(comics_all_collocations): #En reversa para evitar problemas con los índices
        if(w in d):
            tfidf = d.count(w) / N * math.log(numDocs_comics/dict_all_collocations[w], 2)
            if(tfidf > 0.01):
                top_collocations.append(w)
                comics_all_collocations.remove(w)

80691 
 [('space adventurer', 1), ('whose appearance', 1), ('man script', 1), ('special issue', 1), ('though highly', 1), ('one miniseries', 1), ('archives lost', 1), ('costume list', 1), ('named owen', 1), ('couple named', 1), ('accident prone', 1), ('catastrophe segment', 1), ('kitty catastrophe', 1), ('spectacular voiced', 1), ('star spectacular', 1), ('four star', 1), ('episode four', 1), ('stuffed shirt', 1), ('mother walter', 1), ('whose appearances', 1), ('baby whose', 1), ('usually grunts', 1), ('building never', 1), ('front steps', 1), ('heroics mrs', 1), ('women sgt', 1), ('pursuing women', 1), ('drinking beer', 1), ('time wondering', 1), ('sister guido', 1), ('looking half', 1), ('human looking', 1), ('watson denton', 1), ('quincy high', 1), ('john quincy', 1), ('glove winner', 1), ('gold glove', 1), ('brenda baseball', 1), ('crook married', 1), ('bank job', 1), ('bank promoted', 1), ('richmond bank', 1), ('south richmond', 1), ('assistant manager', 1), ('valentine assistant

In [23]:
collocations_tokens = [b for b in common_clean_collocations.most_common() 
                      if b[0] in top_collocations]
comicsDf = comicsDf.reindex(columns = ["name", "description", "main_words", "bigrams", 
                                        "clean_bigrams", "all_collocations", "new_description"])

for i, row in zip(range(len(comicsDf)), comicsDf.main_words):
    n = "_".join(comicsDf.loc[i, "name"].split()) + " "
    s = " ".join(row)
    for w in collocations_tokens:
        s = re.sub(" " + w[0], " " + "_".join(w[0].split()), s)
    comicsDf.loc[i, "new_description"] = n.lower() + s
display(comicsDf.head())

new_main_words = []
for row in comicsDf.new_description:
    new_main_words.extend(row.split())
common_new_main_words = nltk.FreqDist(new_main_words)

In [65]:
print("Cantidad de bigramas en el corpus: ", len(common_new_main_words.most_common()))

lexicon_comics = [w[0] for w in common_new_main_words.most_common(6000)]
print("\nBigramas más populares:\n{}\n...\n{}"
      .format(lexicon_comics[:50], lexicon_comics[5950:6000]))

Cantidad de bigramas en el corpus:  93068

Bigramas más populares:
['x_men', 'one', 'time', 'marvel_comics', 'earth', 'comics', 'powers', 'spider_man', 'also', 'series', 'character', 'team', 'appears', 'batman', 'justice_league', 'member', 'however', 'battle', 'dc_comics', 'death', 'superman', 'later', 'created', 'new', 'father', 'version', 'body', 'avengers', 'able', 'help', 'power', 'revealed', 'vol', 'green_lantern', 'life', 'part', 'killed', 'voiced', 'fight', 'ability', 'two', 'world', 'use', 'would', 'first', 'back', 'used', 'well', 'order', 'marvel']
...
['inter', 'carl', 'distinct', 'joined_forces', 'archenemy', 'alley', 'tension', 'swamp', 'religion', 'numbering', 'pit', 'enhance', 'antidote', 'promising', 'gail_simone', 'clashed', 'jpg_left_thumb', 'texas', 'tremendous', 'credits', 'engineered', 'likes', 'cosmic_teams', 'gave_birth', 'concussive_force', 'liquid', 'intervene', 'combine', 'labs', 'long_term', 'crime_fighter', 'african', 'bride', 'faked', 'heroic_age', 'wins', '

## Conclusiones

<hr style="border-width: 3px;">

### Tarea 1

Describa un problema de reconocimiento de patrones de su interés y explique por qué un modelo tradicional sería inapropiado para resolverlo (utilice la celda siguiente, en esta libreta, para presentar su problema seleccionado).

**Fecha de entrega**: Viernes 20 de enero.