# Proyecto Final - Recuperación de la Información 
## Otoño 2023 
## Benemérita Universidad Autónoma de Puebla
### Alfonso Reyes D'Elia 
---
### __Planteamiento:__
1. Elegir alguno de los corpus localizados en Teams en la carpeta de Proyecto. Encontrarán ahí documentos para entrenamiento y pruebas o en su defecto tome el 80 por ciento de datos para entrenar y el 20 para probar.
2. Preprocesen los documentos para obtener la representación vectorial de los mismos.
3. Utilicen Weka o Python para clasificar los textos, experimenten clasificando con tf con tf-idf y reduciendo el vocabulario
4. Generen su reporte en el mismo formato de las prácticas reportando la exactitud, precisión, recuerdo y medida F obtenidas.

---
### Preprocesamiento
* Extraeremos la información del dataset: blog-gender-dataset.xlsx

In [3]:
import pandas as pd
blog_df = pd.read_excel("assets/blog-gender-dataset.xlsx", sheet_name="training" ,
                        header=None, usecols=[0,1], names=["text","gender"])


In [11]:
display(blog_df)

Unnamed: 0,text,gender
0,Long time no see. Like always I was rewriting...,M
1,Guest Demo: Eric Iverson’s Itty Bitty Search\...,M
2,Who moved my Cheese??? The world has been de...,M
3,Yesterday I attended a biweekly meeting of an...,M
4,Liam is nothing like Natalie. Natalie never w...,F
...,...,...
3227,It was a scavenger style race with checkpoints...,M
3228,Finally! I got a full day's work done. Almost ...,F
3229,"At the height of laughter, the universe is flu...",M
3230,"I like birds, especially woodpeckers and MOST ...",M


* Antes de hacer cualquier procesamiento, se tiene que generar el modelado del corpus según el modelo vec
Primero, necesitamos obtener el vocabulario, para ello, se extrae todo este de la columna _text_


In [5]:
all_text = ""
for i in blog_df.index:
    all_text = all_text + str(blog_df['text'][i])
print(all_text[:100])


 Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless i


In [8]:
# this function tokenizes the text
import string, re

re_punc = re.compile('[%s]' % re.escape(string.punctuation))
def tokenize_text(text: str, re_punc):
    text_lower = text.lower()
    words = re.split(r'\W+',text_lower)
    return [re_punc.sub("",w) for w in words]

In [7]:
words = tokenize_text(all_text, re_punc)


In [8]:
words[:10]

['', 'long', 'time', 'no', 'see', 'like', 'always', 'i', 'was', 'rewriting']

* Eliminaremos stopwords

In [9]:
import nltk
nltk.download('stopwords')

# Ahora, ya se pueden importar estas palabras sin mucho problema
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Para eliminar las palabras vacías, se hace uso de las operaciones de list comprehension de python
words_clean = [x for x in words if x not in stop_words]
print(words_clean[:100])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\poncho\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['', 'long', 'time', 'see', 'like', 'always', 'rewriting', 'scratch', 'couple', 'times', 'nevertheless', 'still', 'java', 'uses', 'metropolis', 'sampling', 'help', 'poor', 'path', 'tracing', 'converge', 'btw', 'mlt', 'yesterday', 'evening', '2', 'beers', 'ballmer', 'peak', 'altough', 'implementation', 'still', 'fresh', 'easily', 'outperforms', 'standard', 'path', 'tracing', 'seen', 'especially', 'difficult', 'caustics', 'involved', 'implemented', 'spectral', 'rendering', 'easy', 'actually', 'cause', 'computations', 'wavelengths', 'linear', 'like', 'rgb', 'realised', 'even', 'feel', 'physically', 'correct', 'whats', 'point', '3d', 'applications', 'operating', 'rgb', 'color', 'space', 'cant', 'represent', 'rgb', 'color', 'spectrum', 'interchangeably', 'approximate', 'long', 'running', 'physical', 'simulation', 'something', 'see', 'benefits', 'please', 'correct', 'wrong', 'thus', 'abandoned', 'guest', 'demo', 'eric', 'iverson', 'itty', 'bitty', 'search', 'february', '16th', '2010', 'danie

* Ahora, eliminar dígitos y palabras nulas _""_

In [10]:
words_filtered = [x for x in words_clean if not x.isdigit() and x]
print(words_filtered[:100])
print(len(words_filtered))

['long', 'time', 'see', 'like', 'always', 'rewriting', 'scratch', 'couple', 'times', 'nevertheless', 'still', 'java', 'uses', 'metropolis', 'sampling', 'help', 'poor', 'path', 'tracing', 'converge', 'btw', 'mlt', 'yesterday', 'evening', 'beers', 'ballmer', 'peak', 'altough', 'implementation', 'still', 'fresh', 'easily', 'outperforms', 'standard', 'path', 'tracing', 'seen', 'especially', 'difficult', 'caustics', 'involved', 'implemented', 'spectral', 'rendering', 'easy', 'actually', 'cause', 'computations', 'wavelengths', 'linear', 'like', 'rgb', 'realised', 'even', 'feel', 'physically', 'correct', 'whats', 'point', '3d', 'applications', 'operating', 'rgb', 'color', 'space', 'cant', 'represent', 'rgb', 'color', 'spectrum', 'interchangeably', 'approximate', 'long', 'running', 'physical', 'simulation', 'something', 'see', 'benefits', 'please', 'correct', 'wrong', 'thus', 'abandoned', 'guest', 'demo', 'eric', 'iverson', 'itty', 'bitty', 'search', 'february', '16th', 'daniel', 'tunkelang', 

El corpus está compuesto originalmente por 700726, se procederá a determinar el vocabulario (contando las palabras), no sin antes primero aplicar truncamiento (con porter-stemmer) al conjunto.

In [11]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
palabras_truncadas=[]
for word in words_filtered:
   palabras_truncadas.append(stemmer.stem(word))
print(palabras_truncadas[:100])

['long', 'time', 'see', 'like', 'alway', 'rewrit', 'scratch', 'coupl', 'time', 'nevertheless', 'still', 'java', 'use', 'metropoli', 'sampl', 'help', 'poor', 'path', 'trace', 'converg', 'btw', 'mlt', 'yesterday', 'even', 'beer', 'ballmer', 'peak', 'altough', 'implement', 'still', 'fresh', 'easili', 'outperform', 'standard', 'path', 'trace', 'seen', 'especi', 'difficult', 'caustic', 'involv', 'implement', 'spectral', 'render', 'easi', 'actual', 'caus', 'comput', 'wavelength', 'linear', 'like', 'rgb', 'realis', 'even', 'feel', 'physic', 'correct', 'what', 'point', '3d', 'applic', 'oper', 'rgb', 'color', 'space', 'cant', 'repres', 'rgb', 'color', 'spectrum', 'interchang', 'approxim', 'long', 'run', 'physic', 'simul', 'someth', 'see', 'benefit', 'pleas', 'correct', 'wrong', 'thu', 'abandon', 'guest', 'demo', 'eric', 'iverson', 'itti', 'bitti', 'search', 'februari', '16th', 'daniel', 'tunkelang', 'respond', 'back', 'vacat', 'still', 'dig']


In [19]:
# contando palabras con Counter
from collections import Counter

count_sin_red = Counter(palabras_truncadas)
labels, values = zip(*count_sin_red.items())
print(labels[:10])
print(values[:10])
print(len(count_sin_red))

('long', 'time', 'see', 'like', 'alway', 'rewrit', 'scratch', 'coupl', 'nevertheless', 'still')
(1079, 4182, 2189, 4510, 1075, 8, 48, 507, 27, 1391)
36124


Con esto, tenemos un vocabulario de 36124 palabras, procederemos a ver si hay alguna forma de reducirlo

In [20]:
# vemos los datos en deciles
import statistics
deciles = statistics.quantiles(count_sin_red.values(), n=20)
print(deciles)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 5.0, 6.0, 9.0, 14.0, 25.0, 69.0]


Con el análisis por cuantiles (n=20), vemos que la mayoría de las palabras tienen una aparición de entre 1 y 2 veces en el vocabulario, por lo que eliminar solamente las de 1 aparición quitaría aproximadamente el 40% del mismo lo que es demasiado.

In [16]:
len({k: v for k, v in count_sin_red.items() if v == 1}) / len(count_sin_red)

0.44081497065662717

Para ser exactos, el 44%.

Evitando reducir demasiado el vocabulario, se analizará el vocabulario conforme el largo de cada palabra.

In [17]:
len({k: v for k, v in count_sin_red.items() if len(k) <= 4}) / len(count_sin_red)

0.20468386668143063

Un análisis rápido nos arroja que si reducimos el vocabulario con las palabras solo de largo 4 o más, lo acortaremos en solo un 20.46%, lo cual es ideal para nuestra aplicación.

In [22]:
reduced_words = {k: v for k, v in count_sin_red.items() if len(k) > 4}
print("We deleted " + str(len(labels) - len(reduced_words)) + " words from the vocabulary")
print(len(reduced_words))

We deleted 7394 words from the vocabulary
28730


In [23]:
original_words = dict(count_sin_red)

Ahora, ya tenemos un vocabulario reducido y uno sin reducir con los cuales haremos las tareas de clasificación. No sin antes realizar el modelado vectorial con los esquemas tf y tf-idf de ambos vocabularios.

In [24]:
# antes, guardaremos ambos vocabularios
import json
with open('resultados\\vocabularios\\voc_normal.json', 'w') as f:
    f.write(json.dumps(original_words))
    f.close()
with open('resultados\\vocabularios\\voc_reducido.json', 'w') as f:
    f.write(json.dumps(reduced_words))
    f.close()

---
Recuperamos ambos vocabularios:

In [1]:
import json
import re, string
with open('resultados\\vocabularios\\voc_normal.json', 'r') as file:
    original_words = json.load(file)

with open('resultados\\vocabularios\\voc_reducido.json', 'r') as file:
    reduced_words = json.load(file)

In [2]:
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
#def tokenize_text(text: str, re_punc):
docs_sep = list()
for w in blog_df['text']:
    docs_sep.append(tokenize_text(text=str(w), re_punc=re_punc))

NameError: name 'blog_df' is not defined

In [15]:
docs_sep[1][:10]

['',
 'guest',
 'demo',
 'eric',
 'iverson',
 's',
 'itty',
 'bitty',
 'search',
 'february']

In [4]:
order_og_words = dict(sorted(original_words.items()))
order_og_words

{'0001pt': 6,
 '000rpm': 1,
 '000th': 1,
 '000vnd': 1,
 '00am': 12,
 '00i': 1,
 '00p': 4,
 '00pm': 4,
 '024x600': 1,
 '04am': 2,
 '04pm': 1,
 '0in': 23,
 '0j6': 1,
 '0mm': 1,
 '0o': 1,
 '0pt': 8,
 '0px': 14,
 '0rc1': 1,
 '0x16': 1,
 '0x1e9406': 1,
 '0x3c': 1,
 '0x44': 1,
 '0x50': 1,
 '0xa': 1,
 '0xc9': 1,
 '100': 3,
 '1000tag': 1,
 '1005pe': 3,
 '100g': 2,
 '100k': 4,
 '100kg': 1,
 '100kilo': 1,
 '100lb': 1,
 '100m': 1,
 '100mbp': 1,
 '100th': 2,
 '102cm': 1,
 '103rd': 1,
 '105degreesacademi': 1,
 '105mm': 2,
 '1080p': 1,
 '108lb': 1,
 '10am': 7,
 '10cm': 2,
 '10g': 1,
 '10k': 8,
 '10kg': 1,
 '10m': 1,
 '10million': 1,
 '10nth': 2,
 '10pm': 3,
 '10th': 22,
 '10yr': 1,
 '1100ce': 1,
 '110gm': 1,
 '112g': 1,
 '112th': 1,
 '11a': 2,
 '11am': 3,
 '11m': 1,
 '11pm': 4,
 '11st': 2,
 '11th': 19,
 '120mg': 1,
 '124c': 1,
 '124d': 1,
 '124g': 1,
 '124h': 1,
 '124th': 1,
 '128mb': 1,
 '12am': 3,
 '12cm': 1,
 '12gb': 1,
 '12k': 1,
 '12oz': 1,
 '12pm': 4,
 '12second': 2,
 '12th': 16,
 '12v': 1,
 '

In [15]:
"""contamos por documento""" 
## vocabulario completo ##
# blog_df
from operator import countOf
from nltk.stem.porter import PorterStemmer

from operator import countOf


doc_words_per_doc = dict()
i = 1
for w in blog_df['text']:
    doc_words_per_doc[i] = [x for x in tokenize_text(text=str(w), re_punc=re_punc) if x]
    i= i+1
doc_words_per_doc[2]

['guest',
 'demo',
 'eric',
 'iverson',
 's',
 'itty',
 'bitty',
 'search',
 'february',
 '16th',
 '2010',
 'by',
 'daniel',
 'tunkelang',
 'respond',
 'i',
 'm',
 'back',
 'from',
 'vacation',
 'and',
 'still',
 'digging',
 'my',
 'way',
 'out',
 'of',
 'everything',
 'that',
 's',
 'piled',
 'up',
 'while',
 'i',
 've',
 'been',
 'offline',
 'while',
 'i',
 'catch',
 'up',
 'i',
 'thought',
 'i',
 'd',
 'share',
 'with',
 'you',
 'a',
 'demo',
 'that',
 'eric',
 'iverson',
 'was',
 'gracious',
 'enough',
 'to',
 'share',
 'with',
 'me',
 'it',
 'uses',
 'yahoo',
 'boss',
 'to',
 'support',
 'an',
 'exploratory',
 'search',
 'experience',
 'on',
 'top',
 'of',
 'a',
 'general',
 'web',
 'search',
 'engine',
 'when',
 'you',
 'perform',
 'a',
 'query',
 'the',
 'application',
 'retrieves',
 'a',
 'set',
 'of',
 'related',
 'term',
 'candidates',
 'using',
 'yahoo',
 's',
 'key',
 'terms',
 'api',
 'it',
 'then',
 'scores',
 'each',
 'term',
 'by',
 'dividing',
 'its',
 'occurrence',
 '

In [17]:
freq_per_doc_completo = dict()
for d in doc_words_per_doc.keys():
    freq_per_doc_completo[d] = dict().fromkeys(order_og_words.keys())
    for t in order_og_words.keys():
        freq_per_doc_completo[d][t] = countOf(doc_words_per_doc[d], t)

In [None]:
freq_per_doc_completo[3]

{'0001pt': 0,
 '000rpm': 0,
 '000th': 0,
 '000vnd': 0,
 '00am': 0,
 '00i': 0,
 '00p': 0,
 '00pm': 0,
 '024x600': 0,
 '04am': 0,
 '04pm': 0,
 '0in': 0,
 '0j6': 0,
 '0mm': 0,
 '0o': 0,
 '0pt': 0,
 '0px': 0,
 '0rc1': 0,
 '0x16': 0,
 '0x1e9406': 0,
 '0x3c': 0,
 '0x44': 0,
 '0x50': 0,
 '0xa': 0,
 '0xc9': 0,
 '100': 0,
 '1000tag': 0,
 '1005pe': 0,
 '100g': 0,
 '100k': 0,
 '100kg': 0,
 '100kilo': 0,
 '100lb': 0,
 '100m': 0,
 '100mbp': 0,
 '100th': 0,
 '102cm': 0,
 '103rd': 0,
 '105degreesacademi': 0,
 '105mm': 0,
 '1080p': 0,
 '108lb': 0,
 '10am': 0,
 '10cm': 0,
 '10g': 0,
 '10k': 0,
 '10kg': 0,
 '10m': 0,
 '10million': 0,
 '10nth': 0,
 '10pm': 0,
 '10th': 0,
 '10yr': 0,
 '1100ce': 0,
 '110gm': 0,
 '112g': 0,
 '112th': 0,
 '11a': 0,
 '11am': 0,
 '11m': 0,
 '11pm': 0,
 '11st': 0,
 '11th': 0,
 '120mg': 0,
 '124c': 0,
 '124d': 0,
 '124g': 0,
 '124h': 0,
 '124th': 0,
 '128mb': 0,
 '12am': 0,
 '12cm': 0,
 '12gb': 0,
 '12k': 0,
 '12oz': 0,
 '12pm': 0,
 '12second': 0,
 '12th': 0,
 '12v': 0,
 '1300th

Ahora lo mismo pero con el vocabulario reducido

In [5]:
order_red_words = dict(sorted(reduced_words.items()))

In [18]:
## vocabulario reducido ##
# blog_df

freq_per_doc_reducido = dict()
for d in doc_words_per_doc.keys():
    freq_per_doc_reducido[d] = dict().fromkeys(order_red_words.keys())
    for t in order_red_words.keys():
        freq_per_doc_reducido[d][t] = countOf(doc_words_per_doc[d], t)

In [None]:
freq_per_doc_reducido[3]

Este es el modelado en el esquema tf. Procederemos a guardarlo.

In [None]:
with open('resultados\\tfidf\\tf_completo.json', 'w') as f:
    f.write(json.dumps(freq_per_doc_completo))
    f.close()

In [44]:
with open('resultados\\tfidf\\tf_reducido.json', 'w') as f:
    f.write(json.dumps(freq_per_doc_reducido))
    f.close()

Recuperamos los diccionarios de frecuencias:

In [None]:

import string
with open('resultados\\tfidf\\tf_completo.json', 'r') as file:
    freq_per_doc_completo = json.load(file)

In [1]:
import json
with open('resultados\\tfidf\\tf_reducido.json', 'r') as file:
    freq_per_doc_reducido = json.load(file)

Ahora, la representación de los documentos según el esquema tf-idf

In [19]:
#primero, calculamos N_t (# de docs con el término)
from math import log

#order_red_words
N_t = dict().fromkeys(order_og_words.keys())
N_t = { x : 0 for x in N_t}
for i in N_t.keys():
    for j in freq_per_doc_completo.keys():
        if freq_per_doc_completo[j][i] != 0:
            N_t[i] += 1

In [20]:
# calculando idf_i
doc_count = len(blog_df)
idf_i = dict().fromkeys(N_t.keys())
for i in idf_i.keys():
    idf_i[i] = 0
    if N_t[i]:
        idf_i[i] = log(doc_count / N_t[i]) + 1

In [14]:
idf_i['chile']

0

Obteniendo las matrices

In [21]:
#order_og_words_idf = dict().fromkeys(order_og_words.keys())
#for i in order_og_words.keys():
#    order_og_words_idf[i] = order_og_words[i] * idf_i[i]
    
freq_per_doc_idf = dict()
for i in freq_per_doc_completo.keys():
    freq_per_doc_idf[i] = dict().fromkeys(order_og_words.keys())
    for j in order_og_words.keys():
        freq_per_doc_idf[i][j] = freq_per_doc_completo[i][j] * idf_i[j]
        
# freq_per_doc_2   -> vector d

In [41]:
freq_per_doc_idf[1]['altough']

8.38770923908104

* con vocabulario reducido

In [23]:

#order_red_words_idf = dict().fromkeys(order_red_words.keys())
#for i in order_red_words.keys():
#    order_red_words_idf[i] = order_red_words[i] * idf_i[i]
    
freq_per_doc_idf_red = dict()
for i in freq_per_doc_reducido.keys():
    freq_per_doc_idf_red[i] = dict().fromkeys(order_red_words.keys())
    for j in order_red_words.keys():
        freq_per_doc_idf_red[i][j] = freq_per_doc_reducido[i][j] * idf_i[j]
        
# dict_freq_2         -> vector q
# freq_per_doc_2   -> vector d

In [65]:
freq_per_doc_idf_red[1]['altough']

8.38770923908104

Esta es la representación de los documentos con tf-idf. Se procederá a guardar los archivos.

In [22]:
import json
with open('resultados\\tfidf\\idf_completo.json', 'w') as f:
    f.write(json.dumps(freq_per_doc_idf))
    f.close()

In [24]:
import json
with open('resultados\\tfidf\\idf_reducido.json', 'w') as f:
    f.write(json.dumps(freq_per_doc_idf_red))
    f.close()