# Examen Bimestral – Diseño de un Sistema Básico de Recuperación de Información

## Nombre: Wilmer Rivas

## 1. Descargar el Dataset
Descargamos dataset desde Kaggle para su posterior análisis.

In [2]:
import kagglehub

# Descargar la última versión del dataset
path = kagglehub.dataset_download("stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset")
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset


## 2. Cargar y Explorar los Datos
Leemos los archivos CSV y mostramos una vista inicial de los datos para entender su estructura.


In [3]:
import pandas as pd

# Define las rutas de los archivos
movies_path = "/kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_movies.csv"
reviews_path = "/kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_critic_reviews.csv"

# Carga los datos
movies_df = pd.read_csv(movies_path)
reviews_df = pd.read_csv(reviews_path)

# Inspección inicial
print(movies_df.head())
print(reviews_df.head())

                    rotten_tomatoes_link  \
0                              m/0814255   
1                              m/0878835   
2                                   m/10   
3                 m/1000013-12_angry_men   
4  m/1000079-20000_leagues_under_the_sea   

                                         movie_title  \
0  Percy Jackson & the Olympians: The Lightning T...   
1                                        Please Give   
2                                                 10   
3                    12 Angry Men (Twelve Angry Men)   
4                       20,000 Leagues Under The Sea   

                                          movie_info  \
0  Always trouble-prone, the life of teenager Per...   
1  Kate (Catherine Keener) and her husband Alex (...   
2  A successful, middle-aged Hollywood songwriter...   
3  Following the closing arguments in a murder tr...   
4  In 1866, Professor Pierre M. Aronnax (Paul Luk...   

                                   critics_consensus content_

## 3. Preprocesamiento de Datos
Aplicamos limpieza de texto, eliminamos caracteres innecesario, stopwords,  convertimos a minúsculas y usamos stemming para normalizar los textos.


In [44]:
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer

# Definir stopwords manualmente
stop_words = set([
    "a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", 
    "in", "is", "it", "its", "of", "on", "that", "the", "to", "was", "were", 
    "will", "with", "this", "i", "you", "but", "not", "or", "if", "then", "so"
])

# Inicializar el Stemmer
stemmer = PorterStemmer()

# Función de preprocesamiento de texto
def preprocess_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar caracteres especiales y números
    text = re.sub(r"[^a-z\s]", "", text)
    # Tokenizar, eliminar stopwords y aplicar stemming
    tokens = [stemmer.stem(word) for word in text.split() if word not in stop_words]
    # Unir los tokens procesados
    return " ".join(tokens)

# Cargar los datos
movies_path = "/kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_movies.csv"
reviews_path = "/kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_critic_reviews.csv"

movies_df = pd.read_csv(movies_path)
reviews_df = pd.read_csv(reviews_path)

# Preprocesar la columna de críticas
print("Valores nulos en 'review_content':", reviews_df['review_content'].isnull().sum())
reviews_df = reviews_df[reviews_df['review_content'].notnull()]
reviews_df['processed_review'] = reviews_df['review_content'].apply(preprocess_text)

# Crear la matriz TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(reviews_df['processed_review'])

# Vincular críticas con títulos de películas
reviews_with_titles = reviews_df.merge(movies_df[['rotten_tomatoes_link', 'movie_title']], 
                                       on='rotten_tomatoes_link', how='left')

# Mostrar las primeras filas para confirmar
print(reviews_with_titles[['movie_title', 'processed_review']].head())

Valores nulos en 'review_content': 65806
                                         movie_title  \
0  Percy Jackson & the Olympians: The Lightning T...   
1  Percy Jackson & the Olympians: The Lightning T...   
2  Percy Jackson & the Olympians: The Lightning T...   
3  Percy Jackson & the Olympians: The Lightning T...   
4  Percy Jackson & the Olympians: The Lightning T...   

                                    processed_review  
0  fantasi adventur fuse greek mytholog contempor...  
1  uma thurman medusa gorgon coiffur writh snake ...  
2  topnotch cast dazzl special effect tide teen o...  
3  whether audienc get behind lightn thief hard p...  
4  what realli lack lightn thief genuin sens wond...  


## 4. Crear la Matriz TF-IDF
Transformamos los textos procesados en una matriz de características basada en TF-IDF.


In [45]:
# Crear y ajustar la matriz TF-IDF con las críticas procesadas
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(reviews_df['processed_review'])

# Confirmar que el TF-IDF está ajustado
print("TF-IDF ajustado con éxito.")


TF-IDF ajustado con éxito.


## 5. Buscar Películas Relevantes
Usamos similitud de coseno para buscar las películas más relevantes en base a una consulta.


In [68]:
from sklearn.metrics.pairwise import cosine_similarity

# Vincular críticas con títulos y géneros de películas
reviews_with_titles = reviews_df.merge(
    movies_df[['rotten_tomatoes_link', 'movie_title', 'genres']],
    on='rotten_tomatoes_link',
    how='left'
)

def search_reviews(query, tfidf_matrix, vectorizer, reviews_with_titles):
    # Preprocesar la consulta
    query_processed = preprocess_text(query)
    # Vectorizar la consulta
    query_vector = vectorizer.transform([query_processed])
    # Calcular la similitud de coseno
    similarities = cosine_similarity(query_vector, tfidf_matrix)
    # Agregar similitudes al DataFrame
    reviews_with_titles['similarity'] = similarities[0]
    # Retornar las películas más relevantes
    return reviews_with_titles.sort_values(by='similarity', ascending=False).head(10)[['movie_title', 'genres', 'similarity']]

# Prueba con una consulta
query = "horror movies"
results = search_reviews(query, tfidf_matrix, vectorizer, reviews_with_titles)

# Mostrar resultados
print(results.to_string(index=False))



                             movie_title                                                   genres  similarity
                  The Cabin in the Woods                               Horror, Mystery & Suspense    0.870607
                                     8MM                   Action & Adventure, Mystery & Suspense    0.859889
                                  Casper          Drama, Kids & Family, Science Fiction & Fantasy    0.859889
                       Alone in the Dark                               Action & Adventure, Horror    0.859889
                  The Cabin in the Woods                               Horror, Mystery & Suspense    0.752115
Children Shouldn't Play with Dead Things                Comedy, Drama, Horror, Mystery & Suspense    0.721244
                            Dog Soldiers                                                   Horror    0.721244
                                The Blob Classics, Cult Movies, Horror, Science Fiction & Fantasy    0.708537
          

## 6. Evaluar los Resultados
Calculamos Precision@K y Recall para medir la efectividad de la búsqueda.


In [67]:
def evaluate_results(results, ground_truth):
    # Recuperar las películas devueltas y normalizarlas
    retrieved_movies = [movie.lower().strip() for movie in results['movie_title'].tolist()]
    # Normalizar las películas en la Ground Truth
    ground_truth_normalized = [movie.lower().strip() for movie in ground_truth]
    
    # Calcular Precision@K
    if len(retrieved_movies) > 0:
        precision_at_k = len([movie for movie in retrieved_movies if movie in ground_truth_normalized]) / len(retrieved_movies)
    else:
        precision_at_k = 0.0

    # Calcular Recall
    if len(ground_truth_normalized) > 0:
        recall = len([movie for movie in retrieved_movies if movie in ground_truth_normalized]) / len(ground_truth_normalized)
    else:
        recall = 0.0
    
    # Mostrar métricas
    print("\nEvaluación de Resultados:")
    print(f"Precision@K: {precision_at_k:.2f}")
    print(f"Recall: {recall:.2f}")

# Definir Ground Truth (películas relevantes conocidas para la consulta)
ground_truth = ["The Cabin in the Woods", "Insidious", "Halloween", "It", "Sleepy Hollow", "Casper"]

# Evaluar resultados
evaluate_results(results, ground_truth)



Evaluación de Resultados:
Precision@K: 0.50
Recall: 0.83
