### 1 Introduction and Setup
#### 1.1 Introduction

<span style="font-size:20px;"><b>Cinema Battle Cup — Introduction</b></span>

This section presents a small, playful competition designed to demonstrate concepts from the Big Data Foundations course while working with a large-scale, multi-source movie dataset. The goal was to build a “World Cup–style” knockout tournament between well-known cinema characters — heroes and villains — using real metrics enriched from two different data sources: **MovieLens** and **IMDB**.

Although the set of characters was manually selected by the user, all numerical attributes used in the competition were derived through data enrichment techniques applied to the movies in which these characters appear.

<span style="font-size:20px;"><b>Big Data Foundations Context: Data Preparation & Enrichment</b></span>

Before running the competition, several steps aligned with the course methodology were performed:

**1. Conversion of raw datasets to Parquet:**

All MovieLens and IMDB datasets were transformed into Parquet files, taking advantage of:
 - columnar storage
 - efficient compression
 - faster analytical queries
 - better integration with DuckDB

This is a core aspect of scalable data processing discussed in the course.

**2. Integration of MovieLens and IMDB via DuckDB:**

Using DuckDB, multiple views were created on top of the full IMDB dataset.
However, to avoid unnecessary storage and processing, only the IMDB movies that had a valid correspondence to the original MovieLens dataset (via imdbId → tconst) were materialized into new tables/views.

This ensured that the enriched dataset remained:
 - compact
 - relevant
 - faster to query

**3. Data Enrichment:**

For each character, the system aggregated information from both sources:

 - ml_ratings_count — number of ratings on MovieLens
 - imdb_ratings_count — number of ratings on IMDB
 - ml_average_rating — average rating on MovieLens
- imdb_average_rating — average rating on IMDB

These metrics allowed us to compare characters based on the popularity and quality of the movies they appear in — a direct application of multi-source data enrichment.

<span style="font-size:20px;"><b>Competition Structure</b></span>

The tournament is split into two groups:

 - df_characters_heroes — eight iconic heroes
 - df_characters_villains — eight famous villains

Each group follows a knockout format until one champion remains.
Finally, the two champions face each other in the grand final.

**First Round — From Top 8 to Top 4**

Randomised match-ups (using a fixed seed for reproducibility).
Winner = character with the highest:

> ml_ratings_count + imdb_ratings_count

This favours characters appearing in highly viewed or popular films.

**Second Round — From Top 4 to Top 2**

Winner = character with the highest weighted rating:

> 0.5 × ml_average_rating + 0.5 × imdb_average_rating

This rewards characters associated with well-rated films.

**Third Round — Final of Each Category (Top 2 → Winner)**

Winner = character with the strongest combined impact:

>(ml_average_rating × ml_ratings_count) + (imdb_average_rating × imdb_ratings_count)

This incorporates both rating volume and rating quality.

**Final — Hero Champion vs. Villain Champion**

The winners of each category face off using the same metric as the third round.

<span style="font-size:20px;"><b>Reproducibility</b></span>

The random shuffling of match-ups uses a fixed seed, ensuring the entire competition can be reproduced exactly.




#### 1.2 Library and duckdb file import

In [275]:
#================================================
# DATA
#================================================

import duckdb, pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np

# creating the conection to the duckdb database file:
con = duckdb.connect("movielensfull33M.duckdb")

In [276]:
IMDB_DIR = Path("..") / "data" / "Imdb"
IMDB_DIR

WindowsPath('../data/Imdb')

### 2 Functions to help in the search for all time know Characters
#### 2.1 Ad-hoc Query Characters and actors_name by movieID

In [277]:
def search_by_movie_id(movie_id):
    query = """
        WITH ml_ratings AS (
            SELECT
                movieId,
                AVG(rating) AS ml_avg_rating,
                COUNT(*) AS ml_ratings_count
            FROM ratings
            GROUP BY movieId
        ),
        imdb_ratings AS (
            SELECT
                movieId,
                averageRating / 2 AS imdb_avg_rating,
                numVotes AS imdb_ratings_count
            FROM movielens_ratings_imdb
        )
        SELECT
            a.movieId,
            a.characters,
            a.actor_name,
            a.title AS movie_title,
            ROUND(mr.ml_avg_rating, 2) AS ml_avg_rating,
            mr.ml_ratings_count,
            ir.imdb_avg_rating,
            ir.imdb_ratings_count
        FROM movielens_actors AS a
        LEFT JOIN ml_ratings AS mr
            ON mr.movieId = a.movieId
        LEFT JOIN imdb_ratings AS ir
            ON ir.movieId = a.movieId
        WHERE a.movieId = ?
        ORDER BY movie_title, actor_name;
    """
    
    return con.execute(query, [movie_id]).df()



#### 2.2 Function to search movies by Characters

In [278]:
def search_by_character(character):
    query = """
        WITH ml_ratings AS (
            SELECT
                movieId,
                AVG(rating) AS ml_avg_rating,
                COUNT(*)    AS ml_ratings_count
            FROM ratings
            GROUP BY movieId
        ),
        imdb_ratings AS (
            SELECT
                movieId,
                averageRating/2 AS imdb_avg_rating,
                numVotes      AS imdb_ratings_count
            FROM movielens_ratings_imdb
        )
        SELECT
            a.movieId,
            a.characters,
            a.actor_name,
            a.title AS movie_title,
            ROUND(mr.ml_avg_rating, 2) AS ml_avg_rating,
            mr.ml_ratings_count,
            ir.imdb_avg_rating,
            ir.imdb_ratings_count
        FROM movielens_actors AS a
        LEFT JOIN ml_ratings AS mr
            ON mr.movieId = a.movieId
        LEFT JOIN imdb_ratings AS ir
            ON ir.movieId = a.movieId
        WHERE LOWER(a.characters) LIKE LOWER('%' || ? || '%')
        ORDER BY movie_title, actor_name;
    """
    return con.execute(query, [character]).df()



#### 2.3 Function to search movies by Characters and movies title

In [279]:
def search_by_character_and_movies(character, movie_title):
    query = """
        WITH ml_ratings AS (
            SELECT
                movieId,
                AVG(rating) AS ml_avg_rating,
                COUNT(*) AS ml_ratings_count
            FROM ratings
            GROUP BY movieId
        ),
        imdb_ratings AS (
            SELECT
                movieId,
                averageRating / 2 AS imdb_avg_rating,
                numVotes AS imdb_ratings_count
            FROM movielens_ratings_imdb
        )
        SELECT
            a.movieId,
            a.characters,
            a.actor_name,
            a.title AS movie_title,
            ROUND(mr.ml_avg_rating, 2) AS ml_avg_rating,
            mr.ml_ratings_count,
            ir.imdb_avg_rating,
            ir.imdb_ratings_count
        FROM movielens_actors AS a
        LEFT JOIN ml_ratings AS mr
            ON mr.movieId = a.movieId
        LEFT JOIN imdb_ratings AS ir
            ON ir.movieId = a.movieId
        WHERE LOWER(a.characters) LIKE LOWER('%' || ? || '%')
          AND LOWER(a.title) LIKE LOWER('%' || ? || '%')
        ORDER BY movie_title, actor_name;
    """
    
    return con.execute(query, [character, movie_title]).df()


#### 2.4 Function to search movies by Characters and actor name

In [280]:
def search_by_character_and_actor(character, actor_name):
    query = """
        WITH ml_ratings AS (
            SELECT
                movieId,
                AVG(rating) AS ml_avg_rating,
                COUNT(*) AS ml_ratings_count
            FROM ratings
            GROUP BY movieId
        ),
        imdb_ratings AS (
            SELECT
                movieId,
                averageRating / 2 AS imdb_avg_rating,
                numVotes AS imdb_ratings_count
            FROM movielens_ratings_imdb
        )
        SELECT
            a.movieId,
            a.characters,
            a.actor_name,
            a.title AS movie_title,
            ROUND(mr.ml_avg_rating, 2) AS ml_avg_rating,
            mr.ml_ratings_count,
            ir.imdb_avg_rating,
            ir.imdb_ratings_count
        FROM movielens_actors AS a
        LEFT JOIN ml_ratings AS mr
            ON mr.movieId = a.movieId
        LEFT JOIN imdb_ratings AS ir
            ON ir.movieId = a.movieId
        WHERE LOWER(a.characters) LIKE LOWER('%' || ? || '%')
          AND LOWER(a.actor_name) LIKE LOWER('%' || ? || '%')
        ORDER BY movie_title, actor_name;
    """
    
    return con.execute(query, [character, actor_name]).df()


#### 2.5 Function to summarize functions of the result dataframes

In [281]:
import pandas as pd
import numpy as np

def summarize_character(df, character_name=None):
    """
    Summarize a dataframe (from your search_* functions)
    into a single-row dataframe with weighted averages
    for ML and IMDb ratings.
    """
    if df.empty:
        return pd.DataFrame([{
            "character": character_name,
            "movies_count": 0,
            "character_avg_rating": np.nan,
            "character_rating_count": 0,
            "character_imdb_rating": np.nan,
            "character_imdb_count": 0,
        }])
    
    # If character_name is not provided, try to infer it from the dataframe
    if character_name is None:
        # Take the most common or first character string
        character_name = df["characters"].iloc[0]
    
    # 1) number of distinct movies
    movies_count = df["movieId"].nunique()
    
    # 2) total ML ratings count
    character_rating_count = df["ml_ratings_count"].fillna(0).sum()
    
    # 3) weighted average ML rating
    #    sum(ml_avg_rating * ml_ratings_count) / sum(ml_ratings_count)
    ml_weights = df["ml_ratings_count"].fillna(0)
    ml_values = df["ml_avg_rating"]
    if (ml_weights > 0).any():
        character_avg_rating = (ml_values * ml_weights).sum() / ml_weights.sum()
    else:
        character_avg_rating = np.nan
    
    # 4) total IMDb ratings count
    character_imdb_count = df["imdb_ratings_count"].fillna(0).sum()
    
    # 5) weighted average IMDb rating
    imdb_weights = df["imdb_ratings_count"].fillna(0)
    imdb_values = df["imdb_avg_rating"]
    if (imdb_weights > 0).any():
        character_imdb_rating = (imdb_values * imdb_weights).sum() / imdb_weights.sum()
    else:
        character_imdb_rating = np.nan
    
    # Build single-row dataframe
    summary = pd.DataFrame([{
        "character": character_name,
        "movies_count": movies_count,
        "character_avg_rating": character_avg_rating,
        "character_rating_count": character_rating_count,
        "character_imdb_rating": character_imdb_rating,
        "character_imdb_count": character_imdb_count,
    }])
    
    return summary


### 3 Bigest Heroes Ever
#### 3.1 Batman

In [282]:
df=search_by_character_and_movies("batman", "batman")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,176681,Batman,Kevin Conroy,Batman & Harley Quinn (2017),2.84,119,2.95,16283
1,1562,Batman,George Clooney,Batman & Robin (1997),2.19,12649,1.9,279184
2,91054,Batman,Lewis Wilson,Batman (1943),2.95,31,3.0,2690
3,26152,Batman,Adam West,Batman (1966),3.21,1092,3.25,37499
4,592,Batman,Michael Keaton,Batman (1989),3.39,56330,3.75,426194
5,33794,Batman,Christian Bale,Batman Begins (2005),3.92,43300,4.1,1679844
6,167762,Batman,Will Friedle,Batman Beyond Darwyn Cooke's Batman 75th Anniv...,3.35,52,3.9,2667
7,174957,Batman,Kevin Conroy,Batman Beyond: The Movie (1999),3.51,59,3.85,6901
8,174957,Batman,Will Friedle,Batman Beyond: The Movie (1999),3.51,59,3.85,6901
9,153,Batman,Val Kilmer,Batman Forever (1995),2.89,40052,2.75,278849


In [283]:
df_characters_heroes = summarize_character(df)
df_characters_heroes

Unnamed: 0,character,movies_count,character_avg_rating,character_rating_count,character_imdb_rating,character_imdb_count
0,Batman,32,3.288466,184975,3.611069,5268974


#### 3.2 Superman

In [284]:
df=search_by_character_and_movies("superman", "superman")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,157631,Superman,Hwang Jung-min,A Man Who Was Superman (2008),3.47,19,3.6,2350
1,157631,Superman - child,Woo-hyuk Choi,A Man Who Was Superman (2008),3.47,19,3.6,2350
2,140415,Superman,Kirk Alyn,Atom Man vs Superman (1950),2.6,5,3.3,923
3,136864,Superman,Henry Cavill,Batman v Superman: Dawn of Justice (2016),2.7,4625,3.2,782923
4,140439,Superman,David Patrick Wilson,"It's A Bird, It's A Plane, It's Superman! (1975)",1.83,3,1.9,482
5,219488,Superman,George Reeves,Stamp Day for Superman (1954),4.0,3,2.7,526
6,140417,Superman,Kirk Alyn,Superman (1948),2.78,9,3.35,1359
7,2640,Superman,Christopher Reeve,Superman (1978),3.38,18453,3.7,204761
8,217461,Superman,Tim Daly,Superman - The Last Son of Krypton (1996),3.83,6,3.8,3401
9,2641,Superman,Christopher Reeve,Superman II (1980),3.1,10622,3.4,123048


In [285]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.3 James Bond

In [286]:
df= search_by_character("james bond")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,258113,James Bond,Bullet Prakash,Bajarangi (2013),3.5,1,3.0,496
1,262133,James Bond,Barry Nelson,Casino Royale (1954),2.5,1,2.8,1658
2,5796,Sir James Bond,David Niven,Casino Royale (1967),2.88,1120,2.5,34312
3,5796,Evelyn Tremble (James Bond - 007),Peter Sellers,Casino Royale (1967),2.88,1120,2.5,34312
4,49272,James Bond,Daniel Craig,Casino Royale (2006),3.84,28517,4.0,729832
5,3984,James Bond,Sean Connery,Diamonds Are Forever (1971),3.5,5992,3.25,120225
6,5872,James Bond,Pierce Brosnan,Die Another Day (2002),3.09,8720,3.05,237203
7,2949,James Bond,Sean Connery,Dr. No (1962),3.67,9694,3.6,189896
8,2989,James Bond,Roger Moore,For Your Eyes Only (1981),3.44,5212,3.35,113752
9,2948,James Bond,Sean Connery,From Russia with Love (1963),3.69,9586,3.65,154124


In [287]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.4 Harry Potter

In [288]:
df=search_by_character_and_movies("harry potter", "harry potter")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,5816,Harry Potter,Daniel Radcliffe,Harry Potter and the Chamber of Secrets (2002),3.65,31004,3.75,746380
1,81834,Harry Potter,Daniel Radcliffe,Harry Potter and the Deathly Hallows: Part 1 (...,3.84,21781,3.85,646233
2,88125,Harry Potter,Daniel Radcliffe,Harry Potter and the Deathly Hallows: Part 2 (...,3.9,20837,4.05,1019944
3,40815,Harry Potter,Daniel Radcliffe,Harry Potter and the Goblet of Fire (2005),3.77,27128,3.85,733319
4,69844,Harry Potter,Daniel Radcliffe,Harry Potter and the Half-Blood Prince (2009),3.83,21849,3.8,642414
5,54001,Harry Potter,Daniel Radcliffe,Harry Potter and the Order of the Phoenix (2007),3.76,21900,3.75,681723
6,8368,Harry Potter,Daniel Radcliffe,Harry Potter and the Prisoner of Azkaban (2004),3.82,32517,3.95,746697
7,4896,Harry Potter,Daniel Radcliffe,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.7,36127,3.85,924808
8,4896,Baby Harry Potter,Saunders Triplets,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.7,36127,3.85,924808


In [289]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.5 Frodo from the Lord of the Rings

In [290]:
df= search_by_character_and_actor("Frodo","Elijah Wood")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,4993,Frodo,Elijah Wood,"Lord of the Rings: The Fellowship of the Ring,...",4.1,79940,4.45,2152498
1,7153,Frodo,Elijah Wood,"Lord of the Rings: The Return of the King, The...",4.11,75512,4.5,2117156
2,5952,Frodo,Elijah Wood,"Lord of the Rings: The Two Towers, The (2002)",4.08,73687,4.4,1910440


In [291]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.6 Ellen Ripley from Alien

In [292]:
df= search_by_character_and_actor("Ripley","Sigourney Weaver")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,1214,Ripley,Sigourney Weaver,Alien (1979),4.07,46572,4.25,1040401
1,1690,Ripley,Sigourney Weaver,Alien: Resurrection (1997),3.04,14811,3.1,287853
2,1200,Ripley,Sigourney Weaver,Aliens (1986),4.01,40182,4.2,824656
3,1320,Ripley,Sigourney Weaver,Alien³ (a.k.a. Alien 3) (1992),3.12,17653,3.2,347338


In [293]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.7 John Mcclane from Die Hard

In [294]:
df= search_by_character("John McClane")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,80183,John McClane,Matthew Géczy,8th Wonderland (2008),3.19,8,3.05,522
1,1036,John McClane,Bruce Willis,Die Hard (1988),3.94,47472,4.1,1005560
2,1370,John McClane,Bruce Willis,Die Hard 2 (1990),3.46,20122,3.6,400971
3,165,John McClane,Bruce Willis,Die Hard: With a Vengeance (1995),3.52,43336,3.8,425108
4,100498,John McClane,Bruce Willis,"Good Day to Die Hard, A (2013)",2.55,1571,2.6,221078
5,53972,John McClane,Bruce Willis,Live Free or Die Hard (2007),3.43,8784,3.55,432223


In [295]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

#### 3.8 Neo from Matrix

In [296]:
df= search_by_character_and_actor("Neo","Keanu Reeves")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,6365,Neo,Keanu Reeves,"Matrix Reloaded, The (2003)",3.38,30788,3.6,660874
1,6934,Neo,Keanu Reeves,"Matrix Revolutions, The (2003)",3.24,24470,3.35,568043
2,2571,Neo,Keanu Reeves,"Matrix, The (1999)",4.16,107056,4.35,2198642


In [297]:
summary_df = summarize_character(df)
df_characters_heroes = pd.concat([df_characters_heroes, summary_df], ignore_index=True)

### 4 Bigest Villains Ever
#### 4.1 Darth Vader

In [298]:
df= search_by_character_and_movies("darth vader","Star Wars")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,261153,Darth Vader,Matt Sloan,LEGO Star Wars: The Empire Strikes Out (2012),1.13,8,3.55,1736
1,261155,Darth Vader,Phil LaMarr,LEGO Star Wars: The Padawan Menace (2011),2.31,8,3.5,2276
2,136485,Darth Vader,Abraham Benrubi,Robot Chicken: Star Wars (2007),3.48,290,4.0,8864
3,181355,Darth Vader,Abraham Benrubi,Robot Chicken: Star Wars Episode II (2008),3.37,68,4.0,5211
4,181357,Darth Vader,Abraham Benrubi,Robot Chicken: Star Wars Episode III (2010),3.6,97,4.0,4921
5,260,Darth Vader,David Prowse,Star Wars: Episode IV - A New Hope (1977),4.09,97202,4.3,1538496
6,1196,Darth Vader,David Prowse,Star Wars: Episode V - The Empire Strikes Back...,4.12,80200,4.35,1472190
7,1210,Darth Vader,James Earl Jones,Star Wars: Episode VI - Return of the Jedi (1983),3.98,76773,4.15,1186668
8,229523,Darth Vader,Jack Foley,Star Wars: Revelations,2.0,4,2.5,1207
9,229523,Darth Vader,Kevin Zabawa,Star Wars: Revelations,2.0,4,2.5,1207


In [299]:
df_characters_vilains = summarize_character(df)


#### 4.2 Hannibal Lecter

In [300]:
df= search_by_character_and_actor("lecter","Anthony Hopkins")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,4148,Hannibal Lecter,Anthony Hopkins,Hannibal (2001),3.24,12067,3.4,309421
1,5630,Hannibal Lecter,Anthony Hopkins,Red Dragon (2002),3.56,9272,3.6,307464
2,593,Dr. Hannibal Lecter,Anthony Hopkins,"Silence of the Lambs, The (1991)",4.15,101802,4.3,1675121


In [301]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.3 Joker

In [302]:
df1= search_by_character_and_movies("joker","batman")
df2= search_by_character_and_actor("joker","Heath Ledger")

df = pd.concat([df1, df2], ignore_index=True)
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,26152,The Joker,Cesar Romero,Batman (1966),3.21,1092,3.25,37499
1,592,Joker,Jack Nicholson,Batman (1989),3.39,56330,3.75,426194
2,186985,The Joker,Wataru Takagi,Batman Ninja (2018),2.96,199,2.8,23774
3,140115,The Joker,Troy Baker,Batman Unlimited: Monster Mayhem (2015),2.09,40,2.8,3347
4,202099,Joker,Troy Baker,Batman vs. Teenage Mutant Ninja Turtles (2019),3.38,85,3.55,13701
5,178997,Joker,Jeff Bergman,Batman vs. Two-Face (2017),2.66,53,3.1,4929
6,182613,The Joker,Andrew Koenig,Batman: Dead End (2003),3.09,23,3.6,6367
7,165085,The Joker,Jeff Bergman,Batman: Return of the Caped Crusaders (2016),2.88,76,3.35,6869
8,161354,The Joker,Mark Hamill,Batman: The Killing Joke (2016),3.0,524,3.2,64721
9,165153,Joker,John DiMaggio,LEGO DC Comics Super Heroes: Batman: Be-League...,2.63,15,3.2,1777


In [303]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.4 Norman Bates

In [304]:
df= search_by_character("norman bates")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,202171,Norman Bates,Kurt Paul,Bates Motel (1987),2.28,9,1.9,2007.0
1,1219,Norman Bates,Anthony Perkins,Psycho (1960),4.06,28016,4.25,769531.0
2,2389,Norman Bates,Vince Vaughn,Psycho (1998),2.81,3704,2.3,52328.0
3,2902,Norman Bates,Anthony Perkins,Psycho II (1983),2.59,1342,3.3,34265.0
4,2903,Norman Bates,Anthony Perkins,Psycho III (1986),2.13,858,2.75,17997.0
5,184071,Norman Bates,Anthony Perkins,Psycho IV: The Beginning (1990),2.52,29,2.7,10723.0
6,211966,madre de Norman Bates,Silvia Gambino,WHAT DID JACK DO? (2017),3.28,130,,
7,161014,Norman Bates,Scott McGinnis,Wacko (1982),2.61,9,2.45,1600.0


In [305]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.5 Chucky

In [306]:
df= search_by_character_and_actor("Chucky","Brad Dourif")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,2315,Chucky,Brad Dourif,Bride of Chucky (Child's Play 4) (1998),2.21,2359,2.8,69094
1,1991,Chucky,Brad Dourif,Child's Play (1988),2.87,3464,3.35,128788
2,1992,Chucky,Brad Dourif,Child's Play 2 (1990),2.37,1562,3.0,63013
3,1993,Chucky,Brad Dourif,Child's Play 3 (1991),2.11,1249,2.6,50271
4,178447,Chucky,Brad Dourif,Cult of Chucky (2017),2.57,157,2.65,33243
5,8967,Chucky,Brad Dourif,Seed of Chucky (Child's Play 5) (2004),2.24,626,2.45,53532


In [307]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.6 Cruella de De vil

In [308]:
df= search_by_character("Cruella")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,1367,Cruella DeVil,Glenn Close,101 Dalmatians (1996),3.05,11168,2.9,122696
1,2085,Cruella De Vil,Betty Lou Gerson,101 Dalmatians (One Hundred and One Dalmatians...,3.43,10747,3.65,192916
2,121099,Cruella,Susanne Blakeslee,101 Dalmatians II: Patch's London Adventure (2...,2.87,129,2.85,11760
3,3991,Cruella de Vil,Glenn Close,102 Dalmatians (2000),2.38,2288,2.45,41319
4,249540,Cruella,Emma Stone,Cruella (2021),3.5,990,3.65,288493
5,174535,Cruella De Vil,Susanne Blakeslee,Mickey's House of Villains (2001),2.64,33,3.3,4526


In [309]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.7 Michael Corleone from the Godfather

In [310]:
df= search_by_character_and_actor("michael","pacino")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,858,Michael,Al Pacino,"Godfather, The (1972)",4.33,75004,4.6,2173294
1,1221,Michael,Al Pacino,"Godfather: Part II, The (1974)",4.27,47271,4.5,1460084
2,2023,Michael Corleone,Al Pacino,"Godfather: Part III, The (1990)",3.45,14446,3.8,446783


In [311]:
summary_df = summarize_character(df,"Michael Corleone")
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

#### 4.8 Loki da serie Thor

In [312]:
df= search_by_character_and_actor("loki","Tom Hiddleston")
df

Unnamed: 0,movieId,characters,actor_name,movie_title,ml_avg_rating,ml_ratings_count,imdb_avg_rating,imdb_ratings_count
0,89745,Loki,Tom Hiddleston,"Avengers, The (2012)",3.74,27495,4.0,1532467
1,86332,Loki,Tom Hiddleston,Thor (2011),3.32,14900,3.5,949663
2,122916,Loki,Tom Hiddleston,Thor: Ragnarok (2017),3.9,14231,3.95,875233
3,106072,Loki,Tom Hiddleston,Thor: The Dark World (2013),3.19,8374,3.35,764561


In [313]:
summary_df = summarize_character(df)
df_characters_vilains = pd.concat([df_characters_vilains, summary_df], ignore_index=True)

### 5 Competitions
#### 5.1 Summary of characteres in competition
##### 5.1.1 Heroes

In [314]:
df_characters_heroes

Unnamed: 0,character,movies_count,character_avg_rating,character_rating_count,character_imdb_rating,character_imdb_count
0,Batman,32,3.288466,184975,3.611069,5268974
1,Superman,17,2.980822,50456,3.177807,1601806
2,James Bond,31,3.485162,233408,3.520098,6128971
3,Harry Potter,8,3.762671,249270,3.864679,7066326
4,Frodo,3,4.096864,229139,4.451672,6180094
5,Ripley,4,3.781146,119218,3.955242,2500248
6,John McClane,6,3.655324,121293,3.738737,2485462
7,Neo,3,3.873352,162314,4.039663,3427559


#### 5.1.2 Vilains

In [315]:
df_characters_vilains

Unnamed: 0,character,movies_count,character_avg_rating,character_rating_count,character_imdb_rating,character_imdb_count
0,Darth Vader,10,4.062404,254902,4.258658,4240828
1,Hannibal Lecter,3,4.016402,123141,4.084598,2292006
2,The Joker,14,3.802336,126184,4.356721,3885432
3,Norman Bates,8,3.812652,34097,4.040868,888451
4,Chucky,6,2.47405,9417,2.92479,397941
5,Cruella DeVil,6,3.166728,25355,3.41939,661710
6,Michael Corleone,3,4.216274,136721,4.476614,4080161
7,Loki,4,3.607896,65000,3.75362,4121924


### 5.2 Auxiliary functions to deal with the "competition"

In [316]:
# ---- Helper functions for scores ----

def total_ratings(row):
    """Total number of ratings (MovieLens + IMDb)."""
    return row["character_rating_count"] + row["character_imdb_count"]

def weighted_global_rating(row):
    """Weighted average of MovieLens and IMDb ratings."""
    ml_count = row["character_rating_count"]
    imdb_count = row["character_imdb_count"]
    ml_rating = row["character_avg_rating"]
    imdb_rating = row["character_imdb_rating"]
    
    total_count = ml_count + imdb_count
    if total_count == 0:
        return 0.0
    return (ml_rating * ml_count + imdb_rating * imdb_count) / total_count

def product_score(row):
    """Score based on rating * count for both MovieLens and IMDb."""
    ml_part = row["character_avg_rating"] * row["character_rating_count"]
    imdb_part = row["character_imdb_rating"] * row["character_imdb_count"]
    return ml_part + imdb_part


# ---- Function to play one knockout round ----

def play_round(df, score_func, round_name="Round", random_state=42, side_label=""):
    """
    Play a knockout round:
        - Shuffle contestants
        - Pair them 1 vs 1
        - Winner decided by score_func(row)
    Returns:
        winners_df, matches (list of dicts with match results)
    """
    # Shuffle to randomize matchups
    shuffled = df.sample(frac=1, random_state=random_state).reset_index(drop=True)
    
    winners = []
    matches = []

    # We assume an even number of rows (8, 4, 2, etc.)
    for i in range(0, len(shuffled), 2):
        p1 = shuffled.iloc[i]
        p2 = shuffled.iloc[i + 1]
        
        s1 = score_func(p1)
        s2 = score_func(p2)
        
        # Decide winner (tie goes to p1)
        winner = p1 if s1 >= s2 else p2
        
        matches.append({
            "round": round_name,
            "side": side_label,
            "character_1": p1["character"],
            "score_1": s1,
            "character_2": p2["character"],
            "score_2": s2,
            "winner": winner["character"],
        })
        
        winners.append(winner)

    winners_df = pd.DataFrame(winners).reset_index(drop=True)
    return winners_df, matches


# ---- Main tournament function ----

def run_tournament(df_characters_heroes, df_characters_vilains, base_seed=42):
    """
    Run the heroes vs villains tournament.
    
    Input:
        df_characters_heroes  - dataframe with 8 heroes
        df_characters_vilains - dataframe with 8 villains
        
    Returns:
        results: dict with
            - "matches": list of all match dicts
            - "hero_champion": row (Series) of hero champion
            - "villain_champion": row (Series) of villain champion
            - "grand_final": dict with final match info
    """
    all_matches = []
    
    # ---------- HEROES BRACKET ----------
    # Round 1: 8 -> 4 (by total ratings)
    heroes_r1_winners, matches = play_round(
        df_characters_heroes,
        score_func=total_ratings,
        round_name="Heroes R1 (total ratings)",
        random_state=base_seed,
        side_label="heroes"
    )
    all_matches.extend(matches)
    
    # Round 2: 4 -> 2 (by weighted global rating)
    heroes_r2_winners, matches = play_round(
        heroes_r1_winners,
        score_func=weighted_global_rating,
        round_name="Heroes R2 (weighted rating)",
        random_state=base_seed + 1,
        side_label="heroes"
    )
    all_matches.extend(matches)
    
    # Round 3: 2 -> 1 champion (by product score)
    heroes_champion_df, matches = play_round(
        heroes_r2_winners,
        score_func=product_score,
        round_name="Heroes Final (product score)",
        random_state=base_seed + 2,
        side_label="heroes"
    )
    all_matches.extend(matches)
    hero_champion = heroes_champion_df.iloc[0]
    
    # ---------- VILLAINS BRACKET ----------
    villains_r1_winners, matches = play_round(
        df_characters_vilains,
        score_func=total_ratings,
        round_name="Villains R1 (total ratings)",
        random_state=base_seed,
        side_label="villains"
    )
    all_matches.extend(matches)
    
    villains_r2_winners, matches = play_round(
        villains_r1_winners,
        score_func=weighted_global_rating,
        round_name="Villains R2 (weighted rating)",
        random_state=base_seed + 1,
        side_label="villains"
    )
    all_matches.extend(matches)
    
    villains_champion_df, matches = play_round(
        villains_r2_winners,
        score_func=product_score,
        round_name="Villains Final (product score)",
        random_state=base_seed + 2,
        side_label="villains"
    )
    all_matches.extend(matches)
    villain_champion = villains_champion_df.iloc[0]
    
    # ---------- GRAND FINAL ----------
    # Hero champion vs Villain champion using product_score again
    grand_final_contestants = pd.DataFrame([hero_champion, villain_champion]).reset_index(drop=True)
    gf_winners_df, gf_matches = play_round(
        grand_final_contestants,
        score_func=product_score,
        round_name="Grand Final (Hero vs Villain)",
        random_state=base_seed + 3,
        side_label="grand_final"
    )
    all_matches.extend(gf_matches)
    
    grand_final = gf_matches[0]  # only one match
    grand_champion = gf_winners_df.iloc[0]
    
    results = {
        "matches": all_matches,
        "hero_champion": hero_champion,
        "villain_champion": villain_champion,
        "grand_final": grand_final,
        "grand_champion": grand_champion,
    }
    
    return results


### 5.3 Figth!

In this chapter the competition is triggerd by the lauch of the of the sub "run_tournament"

In [317]:
results = run_tournament(df_characters_heroes, df_characters_vilains, base_seed=42)

### 5.4 Competiton results

#### 5.4.1 All rounds

In [318]:
# All matches as a DataFrame
matches_df = pd.DataFrame(results["matches"])
matches_df

Unnamed: 0,round,side,character_1,score_1,character_2,score_2,winner
0,Heroes R1 (total ratings),heroes,Superman,1652262.0,Ripley,2619466.0,Ripley
1,Heroes R1 (total ratings),heroes,Batman,5453949.0,Neo,3589873.0,Batman
2,Heroes R1 (total ratings),heroes,James Bond,6362379.0,Frodo,6409233.0,Frodo
3,Heroes R1 (total ratings),heroes,Harry Potter,7315596.0,John McClane,2606755.0,Harry Potter
4,Heroes R2 (weighted rating),heroes,Frodo,4.438988,Batman,3.600127,Frodo
5,Heroes R2 (weighted rating),heroes,Harry Potter,3.861203,Ripley,3.947318,Ripley
6,Heroes Final (product score),heroes,Ripley,10339870.0,Frodo,28450510.0,Frodo
7,Villains R1 (total ratings),villains,Hannibal Lecter,2415147.0,Cruella DeVil,687065.0,Hannibal Lecter
8,Villains R1 (total ratings),villains,Darth Vader,4495730.0,Loki,4186924.0,Darth Vader
9,Villains R1 (total ratings),villains,The Joker,4011616.0,Chucky,407358.0,The Joker


#### 5.4.2 Results of the hero champion

In [319]:
# Campeão dos heróis
results["hero_champion"]

character                    Frodo
movies_count                     3
character_avg_rating      4.096864
character_rating_count      229139
character_imdb_rating     4.451672
character_imdb_count       6180094
Name: 0, dtype: object

#### 5.4.3 Results of the Vilain Champion

In [320]:
# Campeão dos vilões
results["villain_champion"]

character                 Michael Corleone
movies_count                             3
character_avg_rating              4.216274
character_rating_count              136721
character_imdb_rating             4.476614
character_imdb_count               4080161
Name: 0, dtype: object

#### 5.4.4 Grand Finale

In [321]:
# Finalíssima (detalhe do combate final)
results["grand_final"]

{'round': 'Grand Final (Hero vs Villain)',
 'side': 'grand_final',
 'character_1': 'Frodo',
 'score_1': np.float64(28450505.380000006),
 'character_2': 'Michael Corleone',
 'score_2': np.float64(18841758.99),
 'winner': 'Frodo'}

#### 5.4.5 Final result

In [322]:
# Grande campeão absoluto
results["grand_champion"]

character                    Frodo
movies_count                     3
character_avg_rating      4.096864
character_rating_count      229139
character_imdb_rating     4.451672
character_imdb_count       6180094
Name: 0, dtype: object

### 6 Other interesting queries
#### 6.1 Best ranked Writers (with minimium 100000 ratings)

In [323]:
query_top50_writers = """
WITH joined AS (
    SELECT
        w.writer_name,
        r.movieId,
        r.rating
    FROM ratings r
    JOIN movielens_writers w
        ON w.movieId = r.movieId
    WHERE w.writer_name IS NOT NULL
),
writer_stats AS (
    SELECT
        writer_name,
        COUNT(*) AS total_ratings,
        COUNT(DISTINCT movieId) AS total_movies,
        ROUND(AVG(rating), 2) AS avg_rating
    FROM joined
    GROUP BY writer_name
    HAVING COUNT(*) >= 100000
),
writer_movie_ratings AS (
    SELECT
        writer_name,
        movieId,
        COUNT(*) AS movie_ratings
    FROM joined
    GROUP BY writer_name, movieId
),
best_movie_per_writer AS (
    SELECT
        writer_name,
        movieId,
        movie_ratings,
        ROW_NUMBER() OVER (
            PARTITION BY writer_name
            ORDER BY movie_ratings DESC, movieId
        ) AS rn
    FROM writer_movie_ratings
)
SELECT
    ws.writer_name,
    ws.total_ratings,
    ws.total_movies,
    ws.avg_rating,
    m.movieId          AS top_movie_id,
    m.title            AS top_movie_title,
    bm.movie_ratings   AS top_movie_ratings
FROM writer_stats ws
JOIN best_movie_per_writer bm
    ON ws.writer_name = bm.writer_name
   AND bm.rn = 1
JOIN movies m
    ON m.movieId = bm.movieId
ORDER BY
    ws.avg_rating DESC,
    ws.total_ratings DESC
LIMIT 50;
"""

df_top50_writers = con.sql(query_top50_writers).df()
df_top50_writers


Unnamed: 0,writer_name,total_ratings,total_movies,avg_rating,top_movie_id,top_movie_title,top_movie_ratings
0,Hayao Miyazaki,111224,12,4.14,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,35375
1,Jonathan Nolan,171702,4,4.13,58559,"Dark Knight, The (2008)",65349
2,Christopher Nolan,136039,5,4.11,79132,Inception (2010),65056
3,J.R.R. Tolkien,229152,5,4.1,4993,"Lord of the Rings: The Fellowship of the Ring,...",79940
4,Winston Groom,113581,1,4.07,356,Forrest Gump (1994),113581
5,Ethan Coen,131356,6,4.0,608,Fargo (1996),61977
6,Thomas Harris,127438,6,4.0,593,"Silence of the Lambs, The (1991)",101802
7,Stanley Kubrick,177405,10,3.96,1206,"Clockwork Orange, A (1971)",38195
8,Quentin Tarantino,396857,13,3.95,296,Pulp Fiction (1994),108756
9,Stephen King,332400,60,3.92,318,"Shawshank Redemption, The (1994)",122296


#### 6.2 Best ranked Directors (with minimium 200000 ratings)

In [324]:
query_top50_directors = """
WITH joined AS (
    SELECT
        d.director_name,
        r.movieId,
        r.rating
    FROM ratings r
    JOIN movielens_directors d
        ON d.movieId = r.movieId
    WHERE d.director_name IS NOT NULL
),
director_stats AS (
    SELECT
        director_name,
        COUNT(*) AS total_ratings,
        COUNT(DISTINCT movieId) AS total_movies,
        ROUND(AVG(rating), 2) AS avg_rating
    FROM joined
    GROUP BY director_name
    HAVING COUNT(*) >= 200000
),
director_movie_ratings AS (
    SELECT
        director_name,
        movieId,
        COUNT(*) AS movie_ratings
    FROM joined
    GROUP BY director_name, movieId
),
best_movie_per_director AS (
    SELECT
        director_name,
        movieId,
        movie_ratings,
        ROW_NUMBER() OVER (
            PARTITION BY director_name
            ORDER BY movie_ratings DESC, movieId
        ) AS rn
    FROM director_movie_ratings
)
SELECT
    ds.director_name,
    ds.total_ratings,
    ds.total_movies,
    ds.avg_rating,
    m.movieId        AS top_movie_id,
    m.title          AS top_movie_title,
    bm.movie_ratings AS top_movie_ratings
FROM director_stats ds
JOIN best_movie_per_director bm
    ON ds.director_name = bm.director_name
   AND bm.rn = 1
JOIN movies m
    ON m.movieId = bm.movieId
ORDER BY
    ds.avg_rating DESC,
    ds.total_ratings DESC
LIMIT 50;
"""

df_top50_directors = con.sql(query_top50_directors).df()
df_top50_directors


Unnamed: 0,director_name,total_ratings,total_movies,avg_rating,top_movie_id,top_movie_title,top_movie_ratings
0,Christopher Nolan,360868,11,4.08,58559,"Dark Knight, The (2008)",65349
1,Francis Ford Coppola,212138,22,4.05,858,"Godfather, The (1972)",75004
2,Quentin Tarantino,350473,12,4.01,296,Pulp Fiction (1994),108756
3,Stanley Kubrick,228429,16,3.97,1258,"Shining, The (1980)",40297
4,Peter Jackson,303438,14,3.95,4993,"Lord of the Rings: The Fellowship of the Ring,...",79940
5,David Fincher,290069,10,3.94,2959,Fight Club (1999),86207
6,Martin Scorsese,294606,41,3.93,1213,Goodfellas (1990),44592
7,Ridley Scott,297279,28,3.82,3578,Gladiator (2000),60749
8,Steven Spielberg,702204,35,3.81,527,Schindler's List (1993),84232
9,Robert Zemeckis,373858,22,3.76,356,Forrest Gump (1994),113581


#### 6.2 Best ranked Producers (with minimium 200000 ratings)

In [325]:
query_top50_producers_best_ranked = """
WITH joined AS (
    SELECT
        p.producer_name,
        r.movieId,
        r.rating
    FROM ratings r
    JOIN movielens_main_producers p
        ON p.movieId = r.movieId
    WHERE p.producer_name IS NOT NULL
),
producer_stats AS (
    SELECT
        producer_name,
        COUNT(*) AS total_ratings,
        COUNT(DISTINCT movieId) AS total_movies,
        ROUND(AVG(rating), 2) AS avg_rating
    FROM joined
    GROUP BY producer_name
    HAVING COUNT(*) >= 200000
),
producer_movie_ratings AS (
    SELECT
        producer_name,
        movieId,
        COUNT(*) AS movie_ratings
    FROM joined
    GROUP BY producer_name, movieId
),
best_movie_per_producer AS (
    SELECT
        producer_name,
        movieId,
        movie_ratings,
        ROW_NUMBER() OVER (
            PARTITION BY producer_name
            ORDER BY movie_ratings DESC, movieId
        ) AS rn
    FROM producer_movie_ratings
)
SELECT
    ps.producer_name,
    ps.total_ratings,
    ps.total_movies,
    ps.avg_rating,
    m.movieId AS top_movie_id,
    m.title   AS top_movie_title,
    bm.movie_ratings AS top_movie_ratings
FROM producer_stats ps
JOIN best_movie_per_producer bm
    ON ps.producer_name = bm.producer_name
   AND bm.rn = 1
JOIN movies m
    ON m.movieId = bm.movieId
ORDER BY
    ps.avg_rating DESC,
    ps.total_ratings DESC
LIMIT 50;
"""

df_top50_producers_best_ranked = con.sql(query_top50_producers_best_ranked).df()
df_top50_producers_best_ranked


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,producer_name,total_ratings,total_movies,avg_rating,top_movie_id,top_movie_title,top_movie_ratings
0,Christopher Nolan,259578,10,4.09,58559,"Dark Knight, The (2008)",65349
1,Emma Thomas,302875,10,4.06,58559,"Dark Knight, The (2008)",65349
2,Barrie M. Osborne,254857,9,4.02,4993,"Lord of the Rings: The Fellowship of the Ring,...",79940
3,Lawrence Bender,373487,27,3.99,296,Pulp Fiction (1994),108756
4,Fran Walsh,280892,9,3.98,4993,"Lord of the Rings: The Fellowship of the Ring,...",79940
5,Stanley Kubrick,209286,10,3.97,1258,"Shining, The (1980)",40297
6,Ceán Chaffin,206750,8,3.96,2959,Fight Club (1999),86207
7,Peter Jackson,322301,18,3.94,4993,"Lord of the Rings: The Fellowship of the Ring,...",79940
8,Ethan Coen,255884,19,3.91,608,Fargo (1996),61977
9,Steven Spielberg,347555,27,3.88,527,Schindler's List (1993),84232


#### 6.2 Best ranked 50 Actors with at least 2500000 ratings

In [326]:
query_top50_actors = """
WITH joined AS (
    SELECT
        a.actor_name,
        r.movieId,
        r.rating,
        a.characters
    FROM ratings r
    JOIN movielens_actors a
        ON a.movieId = r.movieId
    WHERE a.actor_name IS NOT NULL
),
actor_stats AS (
    SELECT
        actor_name,
        COUNT(*) AS total_ratings,
        COUNT(DISTINCT movieId) AS total_movies,
        ROUND(AVG(rating), 2) AS avg_rating
    FROM joined
    GROUP BY actor_name
    HAVING COUNT(*) >= 250000
),
actor_movie_ratings AS (
    SELECT
        actor_name,
        movieId,
        COUNT(*) AS movie_ratings,
        MIN(characters) AS character_name   -- caso haja duplicados, escolhe uma
    FROM joined
    GROUP BY actor_name, movieId
),
best_movie_per_actor AS (
    SELECT
        actor_name,
        movieId,
        movie_ratings,
        character_name,
        ROW_NUMBER() OVER (
            PARTITION BY actor_name
            ORDER BY movie_ratings DESC, movieId
        ) AS rn
    FROM actor_movie_ratings
)
SELECT
    s.actor_name,
    s.total_ratings,
    s.total_movies,
    s.avg_rating,
    m.movieId AS top_movie_id,
    m.title   AS top_movie_title,
    b.character_name AS top_movie_character,
    b.movie_ratings  AS top_movie_ratings
FROM actor_stats s
JOIN best_movie_per_actor b
    ON s.actor_name = b.actor_name
   AND b.rn = 1
JOIN movies m
    ON m.movieId = b.movieId
ORDER BY
    s.avg_rating DESC,
    s.total_ratings DESC
LIMIT 50;
"""

df_top50_actors = con.sql(query_top50_actors).df()
df_top50_actors


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,actor_name,total_ratings,total_movies,avg_rating,top_movie_id,top_movie_title,top_movie_character,top_movie_ratings
0,Sala Baker,313687,4,4.09,4993,"Lord of the Rings: The Fellowship of the Ring,...",Goblin,239820
1,Peter Mayhew,254428,5,4.06,260,Star Wars: Episode IV - A New Hope (1977),Chewbacca,97202
2,Terry Jones,296363,13,4.04,1136,Monty Python and the Holy Grail (1975),Dennis's Mother,143535
3,Terry Gilliam,295367,12,4.04,1136,Monty Python and the Holy Grail (1975),Green Knight,143535
4,Graham Chapman,295311,11,4.04,1136,Monty Python and the Holy Grail (1975),King Arthur,143535
5,Michael Palin,354277,25,4.0,1136,Monty Python and the Holy Grail (1975),Dennis,143535
6,Elijah Wood,328572,43,3.96,4993,"Lord of the Rings: The Fellowship of the Ring,...",Frodo,79940
7,Carrie Fisher,322749,30,3.96,260,Star Wars: Episode IV - A New Hope (1977),Princess Leia Organa,97202
8,Mark Hamill,300064,47,3.96,260,Star Wars: Episode IV - A New Hope (1977),Luke Skywalker,97202
9,Sean Astin,293481,65,3.93,4993,"Lord of the Rings: The Fellowship of the Ring,...",Sam,79940


### 6 Close conection to duckdb

In [327]:
con.close()
print("Ligação fechada.")

Ligação fechada.
