# Getting more movie features using TMDB API with a randomly selected sample dataset (N = 20) from `movie_feb2025.csv`

This **work-in-progress** project enriches a sample movie dataset by leveraging the TMDB API to extract additional features at both the movie and person (cast/crew) levels.

## Step 1: Load and Inspect the Dataset

This step focuses on identifying missing values and understanding the basic structure of the full dataset.

In [1]:
import pandas as pd
df = pd.read_csv('movie_feb2025.csv', on_bad_lines='skip', engine='python')
df['popularity'] = pd.to_numeric(df['popularity'], errors='coerce')

print(f"Successfully loaded. Total number of rows: {len(df)}")
print(f"Actual number of columns: {df.shape[1]}")
print(df.head(5))

Successfully loaded. Total number of rows: 1032235
Actual number of columns: 27
   adult                     backdrop_path  \
0  False  /hQ4pYsIbP22TMXOUdSfC2mjWrO0.jpg   
1  False  /l94l89eMmFKh7na2a1u5q67VgNx.jpg   
2  False  /f2t4JbUvQIjUF5FstG1zZFAp02N.jpg   
3  False  /iUUpKunmBN5l8goObADBaFHnxQ8.jpg   
4  False                               NaN   

                               belongs_to_collection    budget  \
0  {'id': 1382526, 'name': "Kaurismäki's Proletar...         0   
1  {'id': 1382526, 'name': "Kaurismäki's Proletar...         0   
2                                                NaN   4000000   
3                                                NaN  21000000   
4                                                NaN     42000   

                                              genres  \
0  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
1  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
2                     [{'id': 35, 'name': 'Comedy'}]   
3  [{'id': 28, 'name':

In [2]:
df.columns

Index(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'origin_country', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'video', 'vote_average', 'vote_count', 'novelty'],
      dtype='object')

In [3]:
df.dtypes

adult                     object
backdrop_path             object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                       float64
imdb_id                   object
origin_country            object
original_language         object
original_title            object
overview                  object
popularity               float64
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
novelty                  float64
dtype: object

In [4]:
df.isna().sum()

adult                          0
backdrop_path             705978
belongs_to_collection    1002926
budget                      5956
genres                      5956
homepage                  905222
id                          5956
imdb_id                   432462
origin_country              5956
original_language           5956
original_title              7062
overview                  178927
popularity                  8466
poster_path               263570
production_companies        7211
production_countries        8466
release_date              112419
revenue                     8466
runtime                     8466
spoken_languages            8466
status                      8466
tagline                   879456
title                       8475
video                       8466
vote_average                8466
vote_count                  8466
novelty                  1032235
dtype: int64

In [5]:
missing_info = pd.DataFrame({
    'missing_count': df.isna().sum(),
    'missing_percent': (df.isna().sum() / len(df) * 100).round(2)
}).sort_values(by='missing_percent', ascending=False)

print(missing_info)

                       missing_count  missing_percent
novelty                      1032235           100.00
belongs_to_collection        1002926            97.16
homepage                      905222            87.70
tagline                       879456            85.20
backdrop_path                 705978            68.39
imdb_id                       432462            41.90
poster_path                   263570            25.53
overview                      178927            17.33
release_date                  112419            10.89
revenue                         8466             0.82
title                           8475             0.82
vote_count                      8466             0.82
popularity                      8466             0.82
vote_average                    8466             0.82
production_countries            8466             0.82
video                           8466             0.82
runtime                         8466             0.82
spoken_languages            

In [6]:
df[['revenue', 'runtime', 'vote_average', 'vote_count']].describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,1023769.0,1023769.0,1023769.0,1023769.0
mean,760493.8,45.57168,2.175464,23.60902
std,17476980.0,57.47587,3.102931,364.8239
min,-12.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,0.0
50%,0.0,24.0,0.0,0.0
75%,0.0,86.0,5.038,1.0
max,2923706000.0,14400.0,10.0,36753.0


In [7]:
df[['origin_country']].value_counts()

origin_country          
['US']                      414581
['FR']                       56550
['JP']                       44657
['DE']                       42226
['ES']                       34224
                             ...  
['DE', 'KR', 'RU']               1
['DE', 'KR', 'TR']               1
['DE', 'KZ', 'TR', 'GB']         1
['DE', 'LB', 'EG']               1
['FI', 'GB', 'US']               1
Name: count, Length: 5639, dtype: int64

In [8]:
df[['spoken_languages']].value_counts()

spoken_languages                                                                                                                                                                                                                                                                                                                                    
[]                                                                                                                                                                                                                                                                                                                                                      366448
[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]                                                                                                                                                                                                                                                 

### Summary

- The full dataset contains **1,032,235 movies** and **27 columns**.
- Only **0.580%** of entries are missing the TMDB `id`, `origin_country`, and `original_language` fields, these rows can be safely dropped as missing values.
- Just **0.820%** of movies lack audience vote data (`vote_average`, `vote_count`), which makes them suitable candidates for prediction once a model is trained.
- **Image features** such as `poster_path` and `backdrop_path` aren't direct URLs, but can be viewed by prepending them with `https://image.tmdb.org/t/p/original/` according to [TMDB's image guide](https://developer.themoviedb.org/docs/image-basics).  

  Example:  
  `https://image.tmdb.org/t/p/original/hQ4pYsIbP22TMXOUdSfC2mjWrO0.jpg`

## Step 2: Prepare the Test Sample

In this step, I import and explore a 20-row random sample from the full movie dataset (`movie_feb2025.csv`).

In [9]:
sample = df.sample(n=20)
sample.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,novelty
713954,False,/4CqtezRehn0x1RVk7XgCTYk8gBP.jpg,,2900000,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",,1001112.0,tt27817382,['FR'],fr,...,0.0,92.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,Cool Headed,False,6.1,16.0,
992090,False,/bDLLvJfSOMgQj9SsxJhxsqNwRhP.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",https://uptv.com/movies/a-bluegrass-christmas/,1381839.0,tt30816463,['US'],en,...,0.0,84.0,[],Released,,A Bluegrass Christmas,False,6.333,3.0,
197851,False,/uifDpm3WgD8M5iyxFl4uqohWnUV.jpg,,0,"[{'id': 10402, 'name': 'Music'}, {'id': 99, 'n...",,314781.0,tt0460475,['US'],en,...,0.0,123.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Jennifer Lopez: The Reel Me,True,8.737,19.0,
432336,False,/cFl4LeZZbB7vqhh6f99rqSAnU7q.jpg,,500,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",http://www.inthenow.com,610613.0,tt9662726,['US'],en,...,0.0,28.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"If you're here, it's not over",In The Now,False,0.0,0.0,
409212,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10770, 'n...",,580195.0,tt0070408,['GB'],en,...,0.0,20.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Monologue,False,0.0,0.0,


In [10]:
sample.to_csv("sample.csv", index=False)

## Step 3: Enrich Movie-based Features

For each movie, the following TMDB API endpoints were used to extract additional metadata and enrich the dataset:

- **Credits**: cast and crew details (e.g., roles, departments, popularity) 
- **External IDs**: links to IMDb, Wikidata, and social media (Facebook, Instagram, Twitter)  
- **Keywords**: thematic tags and topics  
- **Lists**: user-created watchlists the movie appears in  
- **Recommendations**: similar movies suggested by TMDB  
- **Reviews**: short user reviews with author info  
- **Translations**: available language versions  
- **Videos**: trailers, clips, and teasers  
- **Watch Providers**: platforms where the movie can be streamed or bought

### Credits

#### Appended Variables

Based on the [TMDB API documentation for movie credits](https://developer.themoviedb.org/reference/movie-credits), the following variables were extracted and appended to the dataset:

- `cast_count`: number of cast members  
- `crew_count`: number of crew members  
- `cast`: full JSON object containing cast details  
- `crew`: full JSON object containing crew details  

Each object inside `cast` and `crew` contains structured attributes such as:

- `adult`, `gender`, `id`, `known_for_department`, `name`, `original_name`, `popularity`, `profile_path`, `credit_id`  
- (cast only): `cast_id`, `character`, `order`  
- (crew only): `department`, `job`

These nested JSON records provide rich metadata for each person involved in the film.  
The `id` field from each person entry (renamed as `person_id`) will be used in **Step 4: Extract Credit-based Features (People-level)** to retrieve additional personal-level details from TMDB's `/person` endpoints.

In [11]:
import requests
import time
import json

# Load the sample dataset containing movie IDs
df = pd.read_csv("sample.csv")

# API credentials
api_key = "7b4782c9b0a5abfc789b2b79cfab2601"
bearer_token = "eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiI3YjQ3ODJjOWIwYTVhYmZjNzg5YjJiNzljZmFiMjYwMSIsIm5iZiI6MTc0Mzc1NDE0Mi43NjQ5OTk5LCJzdWIiOiI2N2VmOTM5ZTBjNzkxYmViNTdhY2Y1YWUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.2FmX4__p3XQQi5IagAJ1Csa-3WvL67-msP6158CbsHs"

# Prepare empty lists for new columns
cast_count_list = []
crew_count_list = []
cast_raw_list = []
crew_raw_list = []
error_list = []

# Loop through each movie ID in the dataset and retrieve credits data
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/credits?language=en-US"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            cast = data.get("cast", [])
            crew = data.get("crew", [])

            cast_count_list.append(len(cast))
            crew_count_list.append(len(crew))
            cast_raw_list.append(json.dumps(cast))  # Store raw JSON string
            crew_raw_list.append(json.dumps(crew))
            error_list.append(None)
        else:
            # Handle API error
            cast_count_list.append(None)
            crew_count_list.append(None)
            cast_raw_list.append(None)
            crew_raw_list.append(None)
            error_list.append(response.status_code)
    except Exception as e:
        # Handle other unexpected errors
        cast_count_list.append(None)
        crew_count_list.append(None)
        cast_raw_list.append(None)
        crew_raw_list.append(None)
        error_list.append(str(e))

    time.sleep(0.25)  # Respect API rate limits

# Add new columns to the original DataFrame
df["cast_count"] = cast_count_list
df["crew_count"] = crew_count_list
df["cast"] = cast_raw_list
df["crew"] = crew_raw_list
df["api_error"] = error_list

# Save the updated DataFrame back to sample.csv (overwrite)
df.to_csv("sample.csv", index=False)

### External IDs

#### Appended Variable
Based on the [TMDB API documentation for movie external IDs](https://developer.themoviedb.org/reference/movie-external-ids), the `external_ids` field was appended to the dataset. 

It includes external reference identifiers such as IMDb ID, Wikidata ID, and social media handles (Facebook, Instagram, Twitter), which may support further cross-platform enrichment or validation.

In [12]:
# List to hold external IDs for each movie
external_ids_data = []

# Loop through each movie ID
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/external_ids"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            external_ids_data.append(data)  # Store full dict (includes IMDb, Facebook, Twitter, etc.)
        else:
            external_ids_data.append(None)
    except Exception:
        external_ids_data.append(None)

    time.sleep(0.25)  # Avoid hitting rate limit

# Add as a new column to the dataframe
df["external_ids"] = external_ids_data

# Save updated file
df.to_csv("sample.csv", index=False)

### Keywords

#### Appended Variable
Based on the [TMDB API documentation for movie keywords](https://developer.themoviedb.org/reference/movie-keywords), the `keywords` field was added.  

It contains thematic tags such as `'space travel'`, `'based on manga'`, etc., useful for content-based filtering or topic analysis.

In [13]:
# Initialize a list to store keywords for each movie
keywords_data = []

# Loop through each movie ID
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/keywords"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            keywords = [k["name"] for k in data.get("keywords", [])]
            keywords_data.append(keywords)
        else:
            keywords_data.append(None)
    except Exception as e:
        keywords_data.append(None)

    time.sleep(0.25)  # Avoid rate limiting

# Add keywords as a new column
df["keywords"] = keywords_data

# Save to new CSV or overwrite
df.to_csv("sample.csv", index=False)

### Lists

#### Appended Variable
Based on the [TMDB API documentation for movie lists](https://developer.themoviedb.org/reference/movie-lists), the `list_names` field was appended.  

It captures the names of public user-created lists that the movie appears in, offering insight into audience clustering.

In [14]:
# Initialize list to store list names
list_names_data = []

# Loop through movie IDs
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/lists?language=en-US&page=1"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            names = [item["name"] for item in data.get("results", [])]
            list_names_data.append(names)
        else:
            list_names_data.append(None)
    except Exception as e:
        list_names_data.append(None)

    time.sleep(0.25)  # Prevent hitting the rate limit

# Add list names as a new column
df["list_names"] = list_names_data

# Save the result to CSV
df.to_csv("sample.csv", index=False)

### Recommendations

#### Appended Variable
Based on the [TMDB API documentation for movie recommendations](https://developer.themoviedb.org/reference/movie-recommendations), the `recommendations` field was added.  

It contains a list of similar or related movies suggested by TMDB, useful for modeling content similarity or audience interest overlap.

In [15]:
# Initialize a list to store recommended movie titles
recommendations_data = []

# Loop through each movie ID
for movie_id in df["id"]:
    # Construct the API endpoint for recommendations
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/recommendations?language=en-US&page=1"
    
    # Set up request headers with authorization
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        # Make GET request to TMDB API
        response = requests.get(url, headers=headers)
        
        if response.status_code == 200:
            # Parse response JSON
            data = response.json()
            
            # Extract the titles of recommended movies (if any)
            titles = [item["title"] for item in data.get("results", [])]
            recommendations_data.append(titles)
        else:
            # Append None if API call fails
            recommendations_data.append(None)
    except Exception as e:
        # Append None if request raises an exception
        recommendations_data.append(None)

    # Add a short delay to avoid API rate limits
    time.sleep(0.25)

# Add the recommendations as a new column
df["recommendations"] = recommendations_data

# Save the updated DataFrame to CSV
df.to_csv("sample.csv", index=False)

### Reviews

#### Appended Variable
Based on the [TMDB API documentation for movie reviews](https://developer.themoviedb.org/reference/movie-reviews), the `reviews` field was added.

It includes short user-generated reviews along with authorship, which may be useful for sentiment analysis or understanding audience reception.

In [16]:
# Initialize a list to store review summaries
review_data = []

# Loop through each movie ID
for movie_id in df["id"]:
    # Construct the API URL for movie reviews
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/reviews?language=en-US&page=1"

    # Set up authorization headers
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        # Make the GET request
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            data = response.json()
            # Extract the review contents or author-review pairs
            reviews = [f"{item['author']}: {item['content'][:100]}..." for item in data.get("results", [])]
            review_data.append(reviews)
        else:
            review_data.append(None)
    except Exception as e:
        review_data.append(None)

    # Sleep to avoid hitting the API rate limit
    time.sleep(0.25)

# Add a new column for reviews
df["reviews_TMDB"] = review_data

# Save the updated CSV file
df.to_csv("sample.csv", index=False)

### Translations

#### Appended Variable
Based on the [TMDB API documentation for movie translations](https://developer.themoviedb.org/reference/movie-translations), the `translations` field was added.  

It lists all the languages a movie has been translated into, which helps assess its international availability and cultural reach.

In [17]:
# Create list to store translation language codes
translation_data = []

# Loop through each movie ID
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/translations"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            translations = [item["iso_639_1"] for item in data.get("translations", [])]
            translation_data.append(translations)
        else:
            translation_data.append(None)
    except Exception:
        translation_data.append(None)

    time.sleep(0.25)  # to avoid API rate limit

# Add the translations column to the original dataframe
df["translations"] = translation_data

# Save the updated DataFrame back to CSV
df.to_csv("sample.csv", index=False)

### Videos

#### Appended Variable
Based on the [TMDB API documentation for movie videos](https://developer.themoviedb.org/reference/movie-videos), the `videos` field was appended.  

It includes trailers, teasers, clips, and other official promotional media, often hosted on YouTube, and useful for content preview and analysis.

In [18]:
video_data = []

for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/videos?language=en-US"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            videos = [f"{item['type']}: {item['name']}" for item in data.get("results", [])]
            video_data.append(videos)
        else:
            video_data.append(None)
    except:
        video_data.append(None)

    time.sleep(0.25)  # Avoid rate limit

# Add new column to DataFrame
df["videos"] = video_data

# Save to CSV
df.to_csv("sample.csv", index=False)

### Watch Providers

#### Appended Variable
Based on the [TMDB API documentation for watch providers](https://developer.themoviedb.org/reference/movie-watch-providers), the `watch_providers` field was added.  

It contains information about where a movie can be streamed, rented, or purchased across different countries and platforms (e.g., Netflix, Apple TV, Amazon).

In [19]:
# Initialize a list to store watch provider info
watch_provider_data = []

# Loop through movie IDs
for movie_id in df["id"]:
    url = f"https://api.themoviedb.org/3/movie/{int(movie_id)}/watch/providers"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            # Store full country-wise results for now
            watch_provider_data.append(data.get("results", {}))
        else:
            watch_provider_data.append(None)
    except Exception as e:
        watch_provider_data.append(None)

    time.sleep(0.25)  # Prevent hitting the API rate limit

# Add the data to the dataframe
df["watch_providers"] = watch_provider_data

# Save to CSV
df.to_csv("sample.csv", index=False)

## Step 4: Extract Credit-based Features (People-level)

This step focuses on extracting and enriching information about the people (cast and crew) involved in each movie.

First, the `cast` and `crew` fields in `sample.csv`—originally stored as nested JSON—were expanded. Each individual was transformed into a separate row, linked to their corresponding `movie_id`, along with the following metadata:

- `person_id`, `role`, `adult`, `gender`, `known_for_department`, `name`, `original_name`, `popularity`, `profile_path`, `cast_id`, `character`, `credit_id`, `order`, `department`, `job`

However, this metadata is limited to role-specific context within a single movie. To enrich the dataset with more comprehensive and person-level insights, three types of additional information were retrieved via the TMDB API:

- **Details**: includes biography, birthday, place of birth, and other personal metadata  
- **Combined Credits**: a complete list of works in which the person participated, both as cast and crew 
- **External IDs**: links to external platforms such as IMDb, Wikidata, and social media (e.g., Facebook, Instagram, Twitter, YouTube, TikTok)

The new variables from these API responses were merged into the dataset, allowing for deeper analysis of individual careers, popularity, and influence across multiple productions.

In [20]:
import json

# Load the sample dataset
df = pd.read_csv("sample.csv")

# List to collect detailed cast and crew rows
detailed_rows = []

# Iterate through each row
for idx, row in df.iterrows():
    movie_id = row["id"]
    
    # Parse cast
    try:
        cast_list = json.loads(row["cast"])
        for person in cast_list:
            person_data = person.copy()
            person_data["movie_id"] = movie_id
            person_data["role"] = "cast"
            person_data["person_id"] = person["id"]  # Preserve for clarity
            person_data.pop("id", None)  # Remove original 'id' to avoid duplication
            detailed_rows.append(person_data)
    except Exception:
        pass

    # Parse crew
    try:
        crew_list = json.loads(row["crew"])
        for person in crew_list:
            person_data = person.copy()
            person_data["movie_id"] = movie_id
            person_data["role"] = "crew"
            person_data["person_id"] = person["id"]
            person_data.pop("id", None)
            detailed_rows.append(person_data)
    except Exception:
        pass

# Create DataFrame
people_df = pd.DataFrame(detailed_rows)

# Reorder columns
columns = ['movie_id', 'person_id', 'role'] + [col for col in people_df.columns if col not in ['movie_id', 'person_id', 'role']]
people_df = people_df[columns]

# Save to CSV
people_df.to_csv("sample_people.csv", index=False)

### Details  
#### Appended Variable  
Based on the [TMDB API for person details](https://developer.themoviedb.org/reference/person-details), additional metadata was appended for each `person_id`. The most important field is `biography`, which offers rich contextual background about the person.

In [21]:
# Load the original dataset containing person_id values
df = pd.read_csv("sample_people.csv")

# Bearer Token for API access
bearer_token = "eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiI3YjQ3ODJjOWIwYTVhYmZjNzg5YjJiNzljZmFiMjYwMSIsIm5iZiI6MTc0Mzc1NDE0Mi43NjQ5OTk5LCJzdWIiOiI2N2VmOTM5ZTBjNzkxYmViNTdhY2Y1YWUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.2FmX4__p3XQQi5IagAJ1Csa-3WvL67-msP6158CbsHs"

# List to store API response details for each person
detail_list = []

# Loop through each person_id and retrieve their detailed info from TMDB
for person_id in df["person_id"]:
    url = f"https://api.themoviedb.org/3/person/{int(person_id)}?language=en-US"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        # Send GET request
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            detail = response.json()  # Successful response
        else:
            detail = {}  # Handle non-200 status
    except Exception as e:
        detail = {}  # Handle exceptions (e.g., connection errors)

    # Add the result to our list
    detail_list.append(detail)
    
    # Pause briefly to avoid hitting API rate limits
    time.sleep(0.25)

# Convert the collected data into a DataFrame
detail_df = pd.DataFrame(detail_list)

# Concatenate the original DataFrame with the new detailed info
merged_df = pd.concat([df.reset_index(drop=True), detail_df.reset_index(drop=True)], axis=1)

# Save the final merged dataset to a CSV file
merged_df.to_csv("sample_people_detailed.csv", index=False)

#### Drop Duplicated Columns  
The columns `"id"`, `"profile_path"`, `"popularity"`, `"name"`, `"known_for_department"`, `"gender"`, and `"adult"` were already present in the original dataset. To avoid redundancy after merging with the `details` API results, these duplicated columns were dropped.

In [22]:
# Load the dataset
df = pd.read_csv("sample_people_detailed.csv")

# Drop duplicated columns (these are the second occurrences after merging)
df = df.drop(columns=["id"])
columns_to_drop = ["profile_path", "popularity", "name", "known_for_department", "gender", "adult"]
duplicated_columns = [col for col in df.columns if col.endswith(".1") and col.replace(".1", "") in columns_to_drop]
people_df = df.drop(columns=duplicated_columns)

# Save the cleaned dataframe
people_df.to_csv("sample_people.csv", index=False)

### Combined Credits

#### Appended Variable

To expand the dataset with richer people-level metadata, we used the [TMDB Combined Credits API](https://developer.themoviedb.org/reference/person-combined-credits) for each `person_id` extracted from cast and crew members.

The following Python code iterates through each `person_id`, retrieves their acting and crew contributions across both movies and TV shows, and appends four new columns to the dataset:

- `combined_cast_count`: number of titles where the person acted  
- `combined_crew_count`: number of titles where the person contributed as crew  
- `combined_cast`: full JSON list of cast credits  
- `combined_crew`: full JSON list of crew credits  

Each entry in the cast/crew list includes fields such as:
- `title`, `release_date`, `overview`, `media_type`, `vote_average`, `vote_count`, `character`, `job`, `department`

In [23]:
# Load dataset
df = pd.read_csv("sample_people.csv")

# Get unique person_ids to avoid duplication
unique_person_ids = df["person_id"].unique()

# Initialize list to store combined credits
combined_credits_data = []

# Iterate through unique person_id only
for person_id in unique_person_ids:
    url = f"https://api.themoviedb.org/3/person/{int(person_id)}/combined_credits?language=en-US"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            combined_cast = json.dumps(data.get("cast", []))
            combined_crew = json.dumps(data.get("crew", []))
            cast_count = len(data.get("cast", []))
            crew_count = len(data.get("crew", []))
        else:
            combined_cast = None
            combined_crew = None
            cast_count = None
            crew_count = None
    except:
        combined_cast = None
        combined_crew = None
        cast_count = None
        crew_count = None

    combined_credits_data.append({
        "person_id": person_id,
        "combined_cast_count": cast_count,
        "combined_crew_count": crew_count,
        "combined_cast": combined_cast,
        "combined_crew": combined_crew
    })

    time.sleep(0.25)

# Convert to DataFrame
credits_df = pd.DataFrame(combined_credits_data)

# Merge back with original df
df = pd.merge(df, credits_df, on="person_id", how="left")

# Save updated DataFrame
df.to_csv("sample_people.csv", index=False)

### External IDs

#### Appended Variable
Based on the [TMDB API for Person External IDs](https://developer.themoviedb.org/reference/person-external-ids), we retrieved additional platform-specific identifiers for each `person_id`, including:

- `imdb_id`  
- `wikidata_id`  
- `facebook_id`  
- `instagram_id`  
- `tiktok_id`  
- `twitter_id`  
- `youtube_id`  

These values were appended to the dataset as **separate columns**, rather than a nested dictionary, for easier access and future analysis (e.g., social presence, cross-database linking).

In [24]:
# Load the full dataset (with repeated person_id values)
df = pd.read_csv("sample_people.csv")

# Only keep unique person_ids for API call
unique_ids = df["person_id"].drop_duplicates()

# Bearer token
bearer_token = "YOUR_BEARER_TOKEN"

# Collect external ID info
external_ids_list = []

# Loop through each unique person_id
for person_id in unique_ids:
    url = f"https://api.themoviedb.org/3/person/{int(person_id)}/external_ids"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {bearer_token}"
    }

    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
        else:
            data = {
                "imdb_id": None,
                "wikidata_id": None,
                "facebook_id": None,
                "instagram_id": None,
                "tiktok_id": None,
                "twitter_id": None,
                "youtube_id": None
            }
    except Exception:
        data = {
            "imdb_id": None,
            "wikidata_id": None,
            "facebook_id": None,
            "instagram_id": None,
            "tiktok_id": None,
            "twitter_id": None,
            "youtube_id": None
        }

    data["person_id"] = person_id
    external_ids_list.append(data)
    time.sleep(0.25)  # Avoid rate limit

# Turn external_ids into DataFrame
external_ids_df = pd.DataFrame(external_ids_list)

# Merge back to original df
df = pd.merge(df, external_ids_df, on="person_id", how="left")

# Save final result
df.to_csv("sample_people.csv", index=False)