# Taller 06: Base de Datos Vectoriales

### Parte 1: Recuperación con TF-IDF

**1. Carga los datos en Python**

In [1]:
import pandas as pd

# Carga el archivo CSV en un DataFrame
df = pd.read_csv("../data/wiki_movie_plots_deduped.csv")

In [2]:
# Filtra únicamente las columnas Title y Plot
df_filtered = df[['Title', 'Plot']]

# Muestra las primeras filas del DataFrame filtrado
print(df_filtered.head())

                              Title  \
0            Kansas Saloon Smashers   
1     Love by the Light of the Moon   
2           The Martyred Presidents   
3  Terrible Teddy, the Grizzly King   
4            Jack and the Beanstalk   

                                                Plot  
0  A bartender is working at a saloon, serving dr...  
1  The moon, painted with a smiling face hangs ov...  
2  The film, just over a minute long, is composed...  
3  Lasting just 61 seconds and consisting of two ...  
4  The earliest known adaptation of the classic f...  


In [3]:
# Filtra únicamente las columnas Title y Plot
df_filtered = df[['Title', 'Plot']]

# Muestra las primeras filas del DataFrame filtrado
print(df_filtered.head())

                              Title  \
0            Kansas Saloon Smashers   
1     Love by the Light of the Moon   
2           The Martyred Presidents   
3  Terrible Teddy, the Grizzly King   
4            Jack and the Beanstalk   

                                                Plot  
0  A bartender is working at a saloon, serving dr...  
1  The moon, painted with a smiling face hangs ov...  
2  The film, just over a minute long, is composed...  
3  Lasting just 61 seconds and consisting of two ...  
4  The earliest known adaptation of the classic f...  


**2. Configurar TF-IDF**

- usa la libreria scikit-lear para calcular los puntajes TF-IDF de los plots

In [4]:
import unicodedata
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Función para limpiar texto
def clean_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar tildes
    text = ''.join(
        c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn'
    )
    # Eliminar números, puntuaciones y caracteres no alfabéticos
    text = re.sub(r'[^a-z\s]', '', text)
    return text

# Inicializa el vectorizador TF-IDF con el preprocesador personalizado
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,  # Limita el vocabulario
    preprocessor=clean_text,  # Aplica la limpieza personalizada
    token_pattern=r'\b[a-z]{2,}\b'  # Solo considera palabras de al menos 2 letras
)

# Calcula los puntajes TF-IDF para los Plots
tfidf_matrix = tfidf_vectorizer.fit_transform(df_filtered['Plot'].fillna(''))

# Convierte la matriz dispersa en un DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out(),
    index=df_filtered['Title']
)

# Muestra las primeras filas del DataFrame TF-IDF
print(tfidf_df.head())

KeyboardInterrupt: 

**3. Realizar Consultas:**

- Escribe una función que calculo la similitud entre una consulta y los documentos usando la matriz TF-IDF

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def calcular_similitud(query, tfidf_vectorizer, tfidf_matrix, titles):
    # Limpia la consulta usando la misma lógica de preprocesamiento
    query_cleaned = clean_text(query)

    # Vectoriza la consulta
    query_tfidf = tfidf_vectorizer.transform([query_cleaned])

    # Calcula la similitud coseno entre la consulta y los documentos
    similitudes = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

    # Crea un DataFrame con los resultados
    resultados = pd.DataFrame({
        'Title': titles,
        'Similarity': similitudes
    })

    # Ordena los documentos por similitud en orden descendente
    resultados_ordenados = resultados.sort_values(by='Similarity', ascending=False)

    return resultados_ordenados

# Ejemplo de uso
consulta = "man"
resultados = calcular_similitud(consulta, tfidf_vectorizer, tfidf_matrix, df_filtered['Title'])

# Muestra los resultados
print(resultados.head(10))

                                 Title  Similarity
18255                 The Medicine Man    0.452338
18965                Death Is a Number    0.440332
20811       My Wrongs #8245–8249 & 117    0.392925
22406                        King Dave    0.335430
18390            The Man in the Mirror    0.331631
1942                  Murder in Harlem    0.330803
15734                         The Road    0.324203
16472  Cheech & Chong's Animated Movie    0.316461
23866              Everyday I Love You    0.314217
27894                    Junior Senior    0.307387


**4. Evaluar los resultados:**

- Registra los documentos recuperados y analiza su relevancia

In [None]:
import os

def registrar_documentos(resultados, output_path="resultados_recuperados.csv", threshold=0.5):
    # Asegúrate de que el directorio existe
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    # Marca los documentos como relevantes o no según el umbral
    resultados['Relevancia'] = resultados['Similarity'] >= threshold

    # Guarda los resultados en un archivo CSV
    resultados.to_csv(output_path, index=False)
    print(f"Resultados guardados en: {output_path}")

    # Filtra los documentos relevantes
    resultados_filtrados = resultados[resultados['Relevancia']]
    print(f"Documentos relevantes (similitud >= {threshold}): {len(resultados_filtrados)}")

    return resultados_filtrados

# Llama a la función con los resultados de la consulta
ruta_salida = os.path.join("data", "resultados_recuperados.csv")
documentos_relevantes = registrar_documentos(resultados, output_path=ruta_salida, threshold=0.4)

# Analiza los resultados relevantes
print("Documentos relevantes:")
print(documentos_relevantes[['Title', 'Similarity']].head(10))

Resultados guardados en: data\resultados_recuperados.csv
Documentos relevantes (similitud >= 0.4): 2
Documentos relevantes:
                   Title  Similarity
18255   The Medicine Man    0.452338
18965  Death Is a Number    0.440332


### Parte 2: Recuperación con BM25

**1. Configurar Elasticsearch:**

- Reutiliza el índice creado en el Ejercicio 1 para realizar consultas basadas en BM25

**2. Instalar Whoosh o Rank-BM25:**

Para usar BM25 en Python, podemos usar rank_bm25:

In [16]:
pip install rank-bm25

Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**3. Implementar Recuperación con BM25:**

In [17]:
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

# Asegurar que nltk tenga los recursos necesarios
nltk.download('punkt')

# Tokenizar los documentos (convertir texto en listas de palabras)
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]

# Crear el modelo BM25
bm25 = BM25Okapi(tokenized_documents)

# Definir la consulta y tokenizarla
query = "man"
tokenized_query = word_tokenize(query.lower())

# Obtener los 3 documentos más relevantes
scores = bm25.get_scores(tokenized_query)
top_n = 3
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n]

# Mostrar los resultados
print("Resultados de la consulta con BM25:")
for i, idx in enumerate(top_indices):
    print(f"Resultado {i+1}:")
    print(f"Título: {metadatas[idx]['Title']}")
    print(f"Plot: {documents[idx]}")
    print(f"Score: {scores[idx]}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diego\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Resultados de la consulta con BM25:
Resultado 1:
Título: For Her Sake
Plot: The film is a period drama taking place right before the start of the American Civil War. A young Southern girl chooses between two suitors. She chooses the man who goes to fight Stars and Bars of the Confederacy whilst the rejected suitor goes to fight for the Union. During the war, the Confederate soldier is captured and brought before the Union officer who recognizes him as his rival. The Union man is cruel to his rival and tries to break his spirit with harsh treatment. The girl hears of his plight and becomes determined to rescue him. She evades the guards and gives her lover a file to free himself from the bars. Together they flee and are discovered in the final moments of their escape. One of the sentries shoots at the man, but his shot misses and the two flee on horseback.[1] The Union officer is enraged by the escape and tracks the pair to the girl's home just over the Federal line. He sets up guards a

### Parte 3: Recuperación con FAISS

**1. Configurar FAISS:**

Crear un índice en FAISS y agregar los embeddings generados previamente

**2. Instalar FAISS:**

Si aún no lo tienes instalado, ejecútalo con:

In [13]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp38-cp38-win_amd64.whl.metadata (3.8 kB)
Downloading faiss_cpu-1.8.0.post1-cp38-cp38-win_amd64.whl (14.6 MB)
   ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/14.6 MB 640.0 kB/s eta 0:00:23
   ---------------------------------------- 0.0/14.6 MB 388.9 kB/s eta 0:00:38
   ---------------------------------------- 0.1/14.6 MB 930.9 kB/s eta 0:00:16
    --------------------------------------- 0.2/14.6 MB 1.1 MB/s eta 0:00:13
    --------------------------------------- 0.3/14.6 MB 1.5 MB/s eta 0:00:10
   - -------------------------------------- 0.6/14.6 MB 2.0 MB/s eta 0:00:07
   - -------------------------------------- 0.7/14.6 MB 2.0 MB/s eta 0:00:07
   -- ------------------------------------- 0.8/14.6 MB 2.3 MB/s eta 0:00:07
   --- ------------------------------------ 1.3/14.6 MB 3.1 MB/s eta 0:00:05
   ----- ---------------------------------- 2.2/14.6 MB 4.5 MB/s


[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**2. Configurar FAISS y agregar los embeddings**

In [14]:
import faiss
import numpy as np

# Convertir los embeddings a un array numpy (float32 requerido por FAISS)
embedding_dim = len(embeddings[0])  # Dimensión de los embeddings
index = faiss.IndexFlatL2(embedding_dim)  # Crear un índice de FAISS basado en L2 (distancia euclidiana)

# Agregar los embeddings al índice
index.add(np.array(embeddings, dtype=np.float32))

print(f"Se han agregado {index.ntotal} vectores al índice FAISS.")

Se han agregado 50 vectores al índice FAISS.


**4. Realizar una búsqueda en FAIS:**

Para buscar los documentos más similares a una consulta:

In [15]:
# Definir la consulta
query = "man"

# Generar el embedding de la consulta
query_embedding = model.encode([query]).astype(np.float32)

# Buscar los 3 documentos más similares
k = 3
distances, indices = index.search(query_embedding, k)

# Mostrar los resultados
print("Resultados de la consulta con FAISS:")
for i, idx in enumerate(indices[0]):
    print(f"Resultado {i+1}:")
    print(f"Título: {metadatas[idx]['Title']}")
    print(f"Plot: {documents[idx]}")
    print(f"Distancia: {distances[0][i]}")

Resultados de la consulta con FAISS:
Resultado 1:
Título: The Black Viper
Plot: A thug accosts a girl as she leaves her workplace but a man rescues her. The thug vows revenge and, with the help of two friends, attacks the girl and her rescuer again as they're going for a walk. This time they succeed in kidnapping the rescuer. He is bound and gagged and taken away in a cart. The girl runs home and gets help from several neighbors. They track the ruffians down to a cabin in the mountains where the gang has trapped their victim and set the cabin on fire. A thug and Rescuer fight on the roof of the house.
Distancia: 1.6032384634017944
Resultado 2:
Título: Petticoat Camp
Plot: Only lasting 15 minutes, it is a light-hearted comedy about the battle between the sexes as several married couples go on a camp-out together. The women soon realize that the men expect them to do perform all of the work while they relax, leading to several comedic situations.
Distancia: 1.6491906642913818
Resultado 3

### Parte 4: Recuperación con ChromaDB

**1. Configurar ChromaDB:**

Inicia una base de datos de ChromaDB y define el esquema con los campos Tittle, Plot y Embedding.

**2. Preparar los datos**
Seleccionamos la columna que contiene el texto (por ejemplo, Plot) y una columna como identificadores únicos (ID o similar). Si no hay una columna de IDs, podemos generarla.

In [5]:
df

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...


In [6]:
# Seleccionar columnas de interés
documents = df['Plot'].tolist()  # Textos principales de las películas
ids = df.index.tolist()  # Usamos los índices como IDs únicos
metadatas = df[['Title']].to_dict(orient='records')  # Metadatos relevantes

**3. Generar embeddings**

Usamos SentenceTransformer para convertir cada documento en un vector numérico (embedding).

In [7]:
from sentence_transformers import SentenceTransformer

# Cargar el modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange


In [8]:
# Definir la cantidad de documentos a procesar por motivos de recursos disponibles
num_docs = 50  

# Generar los embeddings solo para los primeros `num_docs` documentos
embeddings = model.encode(documents[:num_docs]).tolist()

**4. Configurar e inicializar ChromaDB**

Inicializamos el cliente de ChromaDB para crear o cargar la colección.

In [9]:
import chromadb
from chromadb.config import Settings

# Inicializar la base de datos vectorial con persistencia
client = chromadb.PersistentClient(path="./chroma_db")  # Almacenará los datos en disco

# Crear o cargar la colección
collection = client.get_or_create_collection("wiki_movies_collection")

**5. Agregar los documentos a la colección**

Añadimos los textos (documents), sus metadatos, IDs y embeddings a la base de datos.

In [10]:
# Asegurar que todas las listas tengan la misma longitud
documents = documents[:num_docs]
metadatas = metadatas[:num_docs]
ids = ids[:num_docs]

# Verificar longitudes
print(f"Documentos: {len(documents)}")
print(f"Metadatos: {len(metadatas)}")
print(f"IDs: {len(ids)}")
print(f"Embeddings: {len(embeddings)}")

Documentos: 50
Metadatos: 50
IDs: 50
Embeddings: 50


In [11]:
# Convertir todos los IDs a cadenas de texto
ids = [str(id) for id in ids]

# Agregar los documentos, metadatos, IDs y embeddings a la colección
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids,
    embeddings=embeddings
)

print("Documentos agregados exitosamente a la colección.")

Documentos agregados exitosamente a la colección.


**6: Realizar la consulta**

Convertimos la consulta "man" en un embedding y buscamos los documentos más similares.

In [12]:
# Consulta
query = "man"

# Generar el embedding de la consulta
query_embedding = model.encode([query]).tolist()

# Realizar la consulta
results = collection.query(
    query_embeddings=query_embedding,
    n_results=3  # Devuelve los 3 documentos más similares
)

# Imprimir los resultados
print("Resultados de la consulta:")
for i, result in enumerate(results['documents']):
    print(f"Resultado {i+1}:")
    print(result)
    print("Metadatos:", results['metadatas'][i])

Resultados de la consulta:
Resultado 1:
["A thug accosts a girl as she leaves her workplace but a man rescues her. The thug vows revenge and, with the help of two friends, attacks the girl and her rescuer again as they're going for a walk. This time they succeed in kidnapping the rescuer. He is bound and gagged and taken away in a cart. The girl runs home and gets help from several neighbors. They track the ruffians down to a cabin in the mountains where the gang has trapped their victim and set the cabin on fire. A thug and Rescuer fight on the roof of the house.", 'Only lasting 15 minutes, it is a light-hearted comedy about the battle between the sexes as several married couples go on a camp-out together. The women soon realize that the men expect them to do perform all of the work while they relax, leading to several comedic situations.', "Before heading out to a baseball game at a nearby ballpark, sports fan Mr. Brown drinks several highball cocktails. He arrives at the ballpark to