# Examen Parcial 1 - Ejercicio 2

**Objetivo:** Encontrar las 3 pel√≠culas m√°s similares a **"The Matrix Revolutions"** usando el archivo `tmdb_5000_movies.csv`.

Se utilizar√°n tres m√©tricas de distancia:
1. **Distancia de Jaccard** (similaridad por g√©neros)
2. **Distancia de Levenshtein** (similaridad por sinopsis)
3. **Distancia Euclidiana** (similaridad por atributos num√©ricos)


In [1]:
import numpy as np
import pandas as pd
import json

# Cargar el dataset
df = pd.read_csv('tmdb_5000_movies.csv')
print(f"Dataset cargado: {df.shape[0]} pel√≠culas, {df.shape[1]} columnas")
print(f"Columnas: {list(df.columns)}")
df.head()


Dataset cargado: 4803 pel√≠culas, 20 columnas
Columnas: ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond‚Äôs past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [None]:
# Verificar que "The Matrix Revolutions" est√° en el dataset (fila 123, √≠ndice 0-based)
matrix_rev = df[df['title'] == 'The Matrix Revolutions']
print(f"√çndice de 'The Matrix Revolutions': {matrix_rev.index.tolist()}")
print(f"G√©neros: {matrix_rev['genres'].values[0]}")
print(f"Overview: {matrix_rev['overview'].values[0]}")
print(f"Budget: {matrix_rev['budget'].values[0]}")
print(f"Revenue: {matrix_rev['revenue'].values[0]}")
print(f"Popularity: {matrix_rev['popularity'].values[0]}")


---
## 1. Similaridad por G√©neros ‚Äî Distancia de Jaccard

Representamos cada pel√≠cula como el **conjunto de sus g√©neros**.

La distancia de Jaccard entre dos pel√≠culas $A$ y $B$ se define como:

$$d_J(A, B) = 1 - \frac{|G_A \cap G_B|}{|G_A \cup G_B|}$$

Donde $G_A$ y $G_B$ son los conjuntos de g√©neros de cada pel√≠cula.


In [2]:
# Funci√≥n para extraer g√©neros del campo JSON
def extract_genres(genres_str):
    """Extrae los nombres de los g√©neros desde el campo JSON del CSV."""
    try:
        genres_list = json.loads(genres_str.replace("'", '"'))
        return set(g['name'] for g in genres_list)
    except:
        return set()

# Extraer g√©neros para todas las pel√≠culas
df['genres_set'] = df['genres'].apply(extract_genres)

# Verificar los g√©neros de The Matrix Revolutions
matrix_idx = 123  # √çndice de The Matrix Revolutions
genres_matrix = df.loc[matrix_idx, 'genres_set']
print(f"G√©neros de 'The Matrix Revolutions': {genres_matrix}")


G√©neros de 'The Matrix Revolutions': {'Adventure', 'Thriller', 'Action', 'Science Fiction'}


In [3]:
# Funci√≥n de distancia de Jaccard
def jaccard_distance(set_a, set_b):
    """
    Calcula la distancia de Jaccard entre dos conjuntos.
    d_J(A, B) = 1 - |A ‚à© B| / |A ‚à™ B|
    Si ambos conjuntos son vac√≠os, retorna 1 (m√°xima distancia).
    """
    union = set_a | set_b
    if len(union) == 0:
        return 1.0
    intersection = set_a & set_b
    return 1.0 - len(intersection) / len(union)

# Calcular distancia de Jaccard entre The Matrix Revolutions y todas las dem√°s pel√≠culas
jaccard_distances = []
for idx, row in df.iterrows():
    if idx == matrix_idx:
        jaccard_distances.append(np.nan)  # No comparar consigo misma
    else:
        d = jaccard_distance(genres_matrix, row['genres_set'])
        jaccard_distances.append(d)

df['jaccard_dist'] = jaccard_distances

# Top 10 pel√≠culas m√°s similares por g√©nero (menor distancia de Jaccard)
top10_jaccard = df.dropna(subset=['jaccard_dist']).nsmallest(10, 'jaccard_dist')[['title', 'genres_set', 'jaccard_dist']]
print("=" * 80)
print("TOP 10 pel√≠culas m√°s similares a 'The Matrix Revolutions' por G√âNEROS (Jaccard)")
print("=" * 80)
for rank, (idx, row) in enumerate(top10_jaccard.iterrows(), 1):
    print(f"{rank:2d}. {row['title']:<45s} d_J = {row['jaccard_dist']:.4f}  G√©neros: {row['genres_set']}")


TOP 10 pel√≠culas m√°s similares a 'The Matrix Revolutions' por G√âNEROS (Jaccard)
 1. Battleship                                    d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 2. Jurassic World                                d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 3. X-Men: The Last Stand                         d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 4. Green Lantern                                 d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 5. G.I. Joe: The Rise of Cobra                   d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 6. Terminator Genisys                            d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 7. X-Men Origins: Wolverine                      d_J = 0.0000  G√©neros: {'Adventure', 'Thriller', 'Action', 'Science Fiction'}
 8. The Matrix

---
## 2. Similaridad por Sinopsis ‚Äî Distancia de Levenshtein

Se usa el campo `overview` como texto. La **distancia de Levenshtein** $d_L(s_1, s_2)$ es el m√≠nimo n√∫mero de ediciones (insertar, borrar, sustituir) para transformar $s_1$ en $s_2$.

**Nota:** Para evitar costo computacional excesivo, se aplica sobre un subconjunto de 300 pel√≠culas y un snippet de los primeros ~150 caracteres del overview.


In [4]:
# Implementaci√≥n de la distancia de Levenshtein (programaci√≥n din√°mica)
def levenshtein_distance(s1, s2):
    """
    Calcula la distancia de Levenshtein entre dos cadenas s1 y s2.
    Es el m√≠nimo n√∫mero de operaciones (insertar, borrar, sustituir)
    para transformar s1 en s2.
    """
    m, n = len(s1), len(s2)

    # Crear matriz de distancias
    dp = np.zeros((m + 1, n + 1), dtype=int)

    # Casos base
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # Rellenar la matriz
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                cost = 0
            else:
                cost = 1
            dp[i][j] = min(
                dp[i - 1][j] + 1,       # Eliminaci√≥n
                dp[i][j - 1] + 1,       # Inserci√≥n
                dp[i - 1][j - 1] + cost  # Sustituci√≥n
            )

    return dp[m][n]

# Verificar con un ejemplo simple
print(f"Levenshtein('kitten', 'sitting') = {levenshtein_distance('kitten', 'sitting')}")
print(f"Levenshtein('abc', 'abc') = {levenshtein_distance('abc', 'abc')}")


Levenshtein('kitten', 'sitting') = 3
Levenshtein('abc', 'abc') = 0


In [5]:
# Tomar una muestra aleatoria de 300 pel√≠culas, asegurando que The Matrix Revolutions est√© incluida
np.random.seed(42)

# Crear subconjunto de 300 pel√≠culas
other_indices = df.index[df.index != matrix_idx].tolist()
sample_indices = list(np.random.choice(other_indices, size=299, replace=False))
sample_indices.append(matrix_idx)
df_sample = df.loc[sample_indices].copy()

print(f"Tama√±o de la muestra: {len(df_sample)}")
print(f"'The Matrix Revolutions' en la muestra: {matrix_idx in df_sample.index}")


Tama√±o de la muestra: 300
'The Matrix Revolutions' en la muestra: True


In [6]:
# Obtener el snippet (~150 caracteres) del overview de The Matrix Revolutions
SNIPPET_LEN = 150

overview_matrix = str(df.loc[matrix_idx, 'overview'])[:SNIPPET_LEN]
print(f"Snippet de 'The Matrix Revolutions' ({len(overview_matrix)} chars):")
print(f"  '{overview_matrix}'")
print()

# Calcular distancia de Levenshtein entre The Matrix Revolutions y las dem√°s pel√≠culas de la muestra
lev_distances = []
for idx, row in df_sample.iterrows():
    if idx == matrix_idx:
        lev_distances.append(np.nan)
    else:
        overview_other = str(row['overview'])[:SNIPPET_LEN]
        d = levenshtein_distance(overview_matrix, overview_other)
        lev_distances.append(d)

df_sample['levenshtein_dist'] = lev_distances

# Top 10 pel√≠culas m√°s similares por sinopsis (menor distancia de Levenshtein)
top10_lev = df_sample.dropna(subset=['levenshtein_dist']).nsmallest(10, 'levenshtein_dist')[['title', 'overview', 'levenshtein_dist']]
print("=" * 80)
print("TOP 10 pel√≠culas m√°s similares a 'The Matrix Revolutions' por SINOPSIS (Levenshtein)")
print("=" * 80)
for rank, (idx, row) in enumerate(top10_lev.iterrows(), 1):
    snippet = str(row['overview'])[:80] + '...'
    print(f"{rank:2d}. {row['title']:<45s} d_L = {int(row['levenshtein_dist']):4d}  Overview: {snippet}")


Snippet de 'The Matrix Revolutions' (150 chars):
  'The human city of Zion defends itself against the massive invasion of the machines as Neo fights to end the war at another front while also opposing t'

TOP 10 pel√≠culas m√°s similares a 'The Matrix Revolutions' por SINOPSIS (Levenshtein)
 1. Zoolander 2                                   d_L =  102  Overview: Derek and Hansel are modelling again when an opposing company attempts to take t...
 2. Bats                                          d_L =  106  Overview: Genetically mutated bats escape and it's up to a bat expert and the local sherif...
 3. A Walk on the Moon                            d_L =  107  Overview: The world of a young housewife is turned upside down when she has an affair with...
 4. Diary of a Wimpy Kid: Dog Days                d_L =  107  Overview: School is out and Greg is ready for the days of summer, when all his plans go wr...
 5. The Hunger Games: Mockingjay - Part 1         d_L =  107  Overview: Katniss Ever

---
## 3. Similaridad por Atributos Num√©ricos ‚Äî Distancia Euclidiana

Se construye un vector num√©rico por pel√≠cula con las variables: **budget**, **revenue** y **popularity**.

Cada variable se estandariza con:

$$z = \frac{x - \mu_x}{\sigma_x}$$

Luego se calcula la distancia Euclidiana:

$$d_E(x, y) = \sqrt{\sum_j (x_j - y_j)^2}$$


In [7]:
# Seleccionar las variables num√©ricas
numeric_cols = ['budget', 'revenue', 'popularity']

# Estandarizar cada variable: z = (x - mu) / sigma
df_numeric = df[numeric_cols].copy()

for col in numeric_cols:
    mu = df_numeric[col].mean()
    sigma = df_numeric[col].std()
    df_numeric[col + '_z'] = (df_numeric[col] - mu) / sigma
    print(f"{col:12s}: Œº = {mu:>15.2f}, œÉ = {sigma:>15.2f}")

z_cols = [col + '_z' for col in numeric_cols]

# Mostrar el vector estandarizado de The Matrix Revolutions
print(f"\nVector estandarizado de 'The Matrix Revolutions':")
print(df_numeric.loc[matrix_idx, z_cols])


budget      : Œº =     29045039.88, œÉ =     40722391.26
revenue     : Œº =     82260638.65, œÉ =    162857100.94
popularity  : Œº =           21.49, œÉ =           31.82

Vector estandarizado de 'The Matrix Revolutions':
budget_z        2.970232
revenue_z       2.104468
popularity_z    1.628758
Name: 123, dtype: float64


In [8]:
# Funci√≥n de distancia Euclidiana
def euclidean_distance(vec_a, vec_b):
    """
    Calcula la distancia Euclidiana entre dos vectores.
    d_E(x, y) = sqrt(sum((x_j - y_j)^2))
    """
    return np.sqrt(np.sum((vec_a - vec_b) ** 2))

# Vector estandarizado de The Matrix Revolutions
vec_matrix = df_numeric.loc[matrix_idx, z_cols].values.astype(float)

# Calcular distancia Euclidiana entre The Matrix Revolutions y todas las dem√°s pel√≠culas
euclid_distances = []
for idx in df.index:
    if idx == matrix_idx:
        euclid_distances.append(np.nan)
    else:
        vec_other = df_numeric.loc[idx, z_cols].values.astype(float)
        d = euclidean_distance(vec_matrix, vec_other)
        euclid_distances.append(d)

df['euclidean_dist'] = euclid_distances

# Top 3 pel√≠culas m√°s similares por atributos num√©ricos (menor distancia Euclidiana)
top3_euclid = df.dropna(subset=['euclidean_dist']).nsmallest(3, 'euclidean_dist')[['title', 'budget', 'revenue', 'popularity', 'euclidean_dist']]
print("=" * 80)
print("TOP 3 pel√≠culas m√°s similares a 'The Matrix Revolutions' por ATRIBUTOS NUM√âRICOS (Euclidiana)")
print("=" * 80)
for rank, (idx, row) in enumerate(top3_euclid.iterrows(), 1):
    print(f"{rank}. {row['title']:<45s} d_E = {row['euclidean_dist']:.4f}")
    print(f"   Budget: ${row['budget']:,.0f}  Revenue: ${row['revenue']:,.0f}  Popularity: {row['popularity']:.2f}")
    print()


TOP 3 pel√≠culas m√°s similares a 'The Matrix Revolutions' por ATRIBUTOS NUM√âRICOS (Euclidiana)
1. Star Trek                                     d_E = 0.2416
   Budget: $150,000,000  Revenue: $385,680,446  Popularity: 73.62

2. Night at the Museum: Battle of the Smithsonian d_E = 0.2760
   Budget: $150,000,000  Revenue: $413,106,170  Popularity: 81.78

3. Mission: Impossible III                       d_E = 0.3623
   Budget: $150,000,000  Revenue: $397,850,012  Popularity: 63.08



---
## Resumen: Las 3 pel√≠culas m√°s similares seg√∫n cada m√©trica


In [9]:
# Resumen comparativo
print("=" * 80)
print("RESUMEN ‚Äî Las 3 pel√≠culas m√°s similares a 'The Matrix Revolutions'")
print("=" * 80)

print("\nüìå Por G√âNEROS (Jaccard):")
top3_jaccard = df.dropna(subset=['jaccard_dist']).nsmallest(3, 'jaccard_dist')
for rank, (idx, row) in enumerate(top3_jaccard.iterrows(), 1):
    print(f"   {rank}. {row['title']:<45s} d_J = {row['jaccard_dist']:.4f}")

print("\nüìå Por SINOPSIS (Levenshtein):")
top3_lev = df_sample.dropna(subset=['levenshtein_dist']).nsmallest(3, 'levenshtein_dist')
for rank, (idx, row) in enumerate(top3_lev.iterrows(), 1):
    print(f"   {rank}. {row['title']:<45s} d_L = {int(row['levenshtein_dist'])}")

print("\nüìå Por ATRIBUTOS NUM√âRICOS (Euclidiana):")
for rank, (idx, row) in enumerate(top3_euclid.iterrows(), 1):
    print(f"   {rank}. {row['title']:<45s} d_E = {row['euclidean_dist']:.4f}")


RESUMEN ‚Äî Las 3 pel√≠culas m√°s similares a 'The Matrix Revolutions'

üìå Por G√âNEROS (Jaccard):
   1. Battleship                                    d_J = 0.0000
   2. Jurassic World                                d_J = 0.0000
   3. X-Men: The Last Stand                         d_J = 0.0000

üìå Por SINOPSIS (Levenshtein):
   1. Zoolander 2                                   d_L = 102
   2. Bats                                          d_L = 106
   3. A Walk on the Moon                            d_L = 107

üìå Por ATRIBUTOS NUM√âRICOS (Euclidiana):
   1. Star Trek                                     d_E = 0.2416
   2. Night at the Museum: Battle of the Smithsonian d_E = 0.2760
   3. Mission: Impossible III                       d_E = 0.3623


---
## An√°lisis: ¬øPor qu√© difieren los resultados?

### ¬øPor qu√© los resultados pueden diferir entre Jaccard, Levenshtein y Euclidiana?

Cada m√©trica captura un **aspecto diferente** de la similaridad entre pel√≠culas:

- **Jaccard (g√©neros):** Mide la similitud **tem√°tica/categ√≥rica**. Dos pel√≠culas son similares si comparten los mismos g√©neros (e.g., Action, Sci-Fi). Es una medida **discreta** basada en conjuntos. No considera la intensidad ni los detalles de la trama, solo la clasificaci√≥n general.

- **Levenshtein (sinopsis):** Mide la similitud **textual** entre las descripciones. Captura coincidencias en la narrativa, vocabulario y estructura de las oraciones. Sin embargo, es sensible a la redacci√≥n: dos pel√≠culas con tramas similares pero descritas con palabras diferentes tendr√°n alta distancia.

- **Euclidiana (atributos num√©ricos):** Mide la similitud en t√©rminos de **escala de producci√≥n y √©xito comercial** (presupuesto, ingresos, popularidad). Pel√≠culas con presupuestos y recaudaciones similares ser√°n cercanas, independientemente de su tem√°tica o trama.

**Los resultados difieren** porque cada m√©trica opera sobre un espacio de caracter√≠sticas completamente distinto. Una pel√≠cula puede tener los mismos g√©neros que "The Matrix Revolutions" (baja distancia Jaccard) pero un presupuesto muy diferente (alta distancia Euclidiana), o viceversa.

### ¬øQu√© distancia usar√≠a para recomendar pel√≠culas? ¬øPor qu√©?

Para un **sistema de recomendaci√≥n**, lo ideal ser√≠a una **combinaci√≥n ponderada** de las tres m√©tricas, ya que cada una captura informaci√≥n complementaria. Sin embargo, si se debe elegir una sola:

- La **distancia de Jaccard por g√©neros** es la m√°s pr√°ctica y robusta para recomendaciones generales, porque los g√©neros son el criterio m√°s intuitivo que los usuarios emplean al buscar pel√≠culas similares ("quiero ver otra pel√≠cula de acci√≥n y ciencia ficci√≥n"). Es computacionalmente eficiente, no depende de la redacci√≥n del overview, y captura la esencia tem√°tica de la pel√≠cula.

- Una mejora adicional ser√≠a combinar Jaccard con la distancia Euclidiana sobre atributos num√©ricos para refinar las recomendaciones seg√∫n el "perfil" de producci√≥n de la pel√≠cula, asegurando que se recomienden pel√≠culas de una escala similar.
