# Comprehensive Data Merge: Final Dataset with OMDB

This notebook merges the final cleaned dataset (SOVAI + TMDB) with OMDB ratings data.

**Key improvements over previous merge:**
- Uses LEFT JOIN to preserve ALL movies from final_df.csv (not just those with OMDB ratings)
- Properly handles duplicate columns from merges
- Cleans up redundant/empty columns
- Provides detailed merge statistics


In [38]:
import pandas as pd
import glob
import os
import csv
from pathlib import Path


## 1. Load Final Dataset (SOVAI + TMDB merged)


In [39]:
CLEAN_DATA_PATH = "../data/cleaned"
OMDB_DATA_PATH = "../omdb_api"

# Load final_df (already has SOVAI + TMDB merged and filtered)
final_df = pd.read_csv(f'{CLEAN_DATA_PATH}/final_df.csv')
print(f"Loaded {len(final_df)} rows, {len(final_df.columns)} columns")
print(f"Movies with imdb_id: {final_df['imdb_id'].notna().sum()}")
print(f"Movies without imdb_id: {final_df['imdb_id'].isna().sum()}")
final_df.head()


Loaded 4274 rows, 51 columns
Movies with imdb_id: 4240
Movies without imdb_id: 34


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,vote_average,vote_count,origin_country,spoken_languages,genre_ids,genre_names,production_company_ids,production_company_names,belongs_to_collection,gross_per_theater
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,7.0,8351.0,US,English,"53, 878, 18, 27","Thriller, Science Fiction, Drama, Horror",11461,Bad Robot,,600691.658333
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,5.856,108.0,US,English,"28, 80, 18, 9648, 53","Action, Crime, Drama, Mystery, Thriller",41427,Suzanne DeLaurentiis Productions,,8297.166667
2,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,5.904,819.0,US,English,"28, 53, 80","Action, Thriller, Crime","1557, 17887, 2890, 10339","The Mark Gordon Company, Midnight Sun Pictures...",12 Rounds Collection,420273.931034
3,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,6.3,3093.0,US,English,"10752, 18, 28, 36","War, Drama, Action, History","79529, 1088, 33681, 130, 101829","Torridon Films, Alcon Entertainment, Black Lab...",,478949.094737
4,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,6.8,5385.0,US,English,"35, 14, 10749","Comedy, Fantasy, Romance","497, 19636","Revolution Studios, Thirteen Productions",,47165.80756


## 2. Load and Combine OMDB Batch Files


In [40]:
# Find all OMDB batch files
csv_files = sorted(glob.glob(f"{OMDB_DATA_PATH}/omdbmovies_batch_*.csv"))
print(f"Found {len(csv_files)} OMDB batch files")

# Load and combine all batches
dfs = []
for file in csv_files:
    df = pd.read_csv(file)
    # Remove rows where Title is null or empty
    df = df[df["Title"].notna()]
    df = df[df["Title"].str.strip() != ""]
    dfs.append(df)
    print(f"  Loaded {os.path.basename(file)}: {len(df)} rows")

# Combine all batches
omdb_merged = pd.concat(dfs, ignore_index=True)
print(f"\nTotal OMDB records before deduplication: {len(omdb_merged)}")

# Remove duplicates based on imdbID (keep first occurrence)
initial_count = len(omdb_merged)
omdb_merged = omdb_merged.drop_duplicates(subset=["imdbID"], keep="first")
duplicates_removed = initial_count - len(omdb_merged)
if duplicates_removed > 0:
    print(f"Removed {duplicates_removed} duplicate entries")

print(f"Total OMDB records: {len(omdb_merged)}")
print(f"Unique IMDb IDs: {omdb_merged['imdbID'].nunique()}")


Found 11 OMDB batch files
  Loaded omdbmovies_batch_0.csv: 814 rows
  Loaded omdbmovies_batch_1.csv: 833 rows
  Loaded omdbmovies_batch_10.csv: 59 rows
  Loaded omdbmovies_batch_2.csv: 861 rows
  Loaded omdbmovies_batch_3.csv: 824 rows
  Loaded omdbmovies_batch_4.csv: 848 rows
  Loaded omdbmovies_batch_5.csv: 846 rows
  Loaded omdbmovies_batch_6.csv: 811 rows
  Loaded omdbmovies_batch_7.csv: 860 rows
  Loaded omdbmovies_batch_8.csv: 848 rows
  Loaded omdbmovies_batch_9.csv: 877 rows

Total OMDB records before deduplication: 8481
Total OMDB records: 8481
Unique IMDb IDs: 8481


## 3. Clean OMDB Data


In [41]:
# Rename imdbID to imdb_id for consistency
omdb_merged = omdb_merged.rename(columns={"imdbID": "imdb_id"})

# Filter to movies released after 1990 (matching final_df filter)
omdb_merged["omdb_release_date"] = pd.to_datetime(omdb_merged["Released"], errors="coerce")
omdb_merged = omdb_merged[omdb_merged["omdb_release_date"] >= pd.Timestamp("1990-01-01")]
print(f"After filtering to post-1990 releases: {len(omdb_merged)} rows")

# Select relevant columns (exclude redundant ones like Type, Season, Episode, etc.)
columns_to_keep = [
    "imdb_id",
    "Title",
    "Year",
    "Rated",
    "Released",
    "Runtime",
    "Genre",
    "Director",
    "Writer",
    "Actors",
    "Plot",
    "Language",
    "Country",
    "Awards",
    "Poster",
    "Metascore",
    "imdbRating",
    "imdbVotes",
    "BoxOffice",
    "Production",
    "Rating_InternetMovieDatabase",
    "Rating_RottenTomatoes",
    "Rating_Metacritic",
]

# Only keep columns that exist in the dataframe
available_columns = [col for col in columns_to_keep if col in omdb_merged.columns]
omdb_cleaned = omdb_merged[available_columns].copy()

# Add prefix to OMDB columns to avoid conflicts (except imdb_id which is the merge key)
omdb_columns = {col: f"omdb_{col.lower()}" if col != "imdb_id" else col 
                for col in omdb_cleaned.columns}
omdb_cleaned = omdb_cleaned.rename(columns=omdb_columns)

print(f"Final OMDB data: {len(omdb_cleaned)} rows, {len(omdb_cleaned.columns)} columns")
omdb_cleaned.head()


After filtering to post-1990 releases: 6871 rows
Final OMDB data: 6871 rows, 23 columns


Unnamed: 0,imdb_id,omdb_title,omdb_year,omdb_rated,omdb_released,omdb_runtime,omdb_genre,omdb_director,omdb_writer,omdb_actors,...,omdb_awards,omdb_poster,omdb_metascore,omdb_imdbrating,omdb_imdbvotes,omdb_boxoffice,omdb_production,omdb_rating_internetmoviedatabase,omdb_rating_rottentomatoes,omdb_rating_metacritic
1,tt9362736,Die My Love,2025,R,07 Nov 2025,119 min,"Drama, Thriller",Lynne Ramsay,"Enda Walsh, Lynne Ramsay, Alice Birch","Jennifer Lawrence, Robert Pattinson, Sissy Spacek",...,10 nominations total,https://m.media-amazon.com/images/M/MV5BYjc5OW...,72.0,6.6,9529.0,"$4,884,888",,6.6/10,,72/100
2,tt29567915,Nuremberg,2025,PG-13,07 Nov 2025,148 min,"Drama, History, Thriller",James Vanderbilt,"Jack El-Hai, James Vanderbilt","Rami Malek, Russell Crowe, Richard E. Grant",...,1 win & 4 nominations total,https://m.media-amazon.com/images/M/MV5BMjZhNG...,,,,,,,67%,
3,tt31227572,Predator: Badlands,2025,PG-13,07 Nov 2025,107 min,"Action, Adventure, Sci-Fi",Dan Trachtenberg,"Patrick Aison, Jim Thomas, John Thomas","Elle Fanning, Dimitrius Schuster-Koloamatangi",...,,https://m.media-amazon.com/images/M/MV5BNTdjZG...,71.0,7.6,18100.0,"$40,000,000",,7.6/10,85%,71/100
4,tt12583926,Anniversary,2025,R,29 Oct 2025,,Thriller,Jan Komasa,"Lori Rosene-Gambino, Jan Komasa","Diane Lane, Kyle Chandler, Zoey Deutch",...,,,,,,,,,62%,
5,tt14661372,Anniversary,2021,,26 Aug 2021,7 min,"Short, Horror",Craig Ouellette,Craig Ouellette,"David Crane, David T. Crane, Katie Peabody",...,1 win,https://m.media-amazon.com/images/M/MV5BZjQ2Yj...,,,,,,,,


## 4. Merge Final Dataset with OMDB Data

**Important:** We use LEFT JOIN to preserve ALL movies from final_df, even if they don't have OMDB data.


In [42]:
# Left join to preserve all movies from final_df
final_merged = final_df.merge(
    omdb_cleaned,
    on="imdb_id",
    how="left",  # Keep all movies from final_df
    suffixes=("", "_omdb")
)

print(f"Merge complete: {len(final_merged)} rows, {len(final_merged.columns)} columns")
if 'omdb_title' in final_merged.columns:
    print(f"Movies with OMDB data: {final_merged['omdb_title'].notna().sum()}")
    print(f"Movies without OMDB data: {final_merged['omdb_title'].isna().sum()}")
    print(f"Percentage with OMDB data: {(final_merged['omdb_title'].notna().sum() / len(final_merged) * 100):.1f}%")
else:
    print("Warning: OMDB data columns not found in merged dataset")


Merge complete: 4274 rows, 73 columns
Movies with OMDB data: 3948
Movies without OMDB data: 326
Percentage with OMDB data: 92.4%


In [43]:
# Filter to movies from 1990 to October 2025 (excluding last 30 days)
from datetime import datetime, timedelta

# Convert release_date to datetime if not already
final_merged['release_date'] = pd.to_datetime(final_merged['release_date'], errors='coerce')

# Set date range: 1990-01-01 to 2025-10-31
start_date = pd.Timestamp('1990-01-01')
end_date = pd.Timestamp('2025-10-31')

# Also exclude movies from last 30 days (if any are after Oct 31)
today = datetime.now()
cutoff_date = today - timedelta(days=30)
cutoff_timestamp = pd.Timestamp(cutoff_date)

# Use the earlier of end_date or cutoff_date
effective_end_date = min(end_date, cutoff_timestamp)

print("\n" + "=" * 80)
print("FILTERING BY DATE RANGE (1990 to October 2025)")
print("=" * 80)
print(f"Initial rows after merge: {len(final_merged)}")
print(f"Filtering to: {start_date.strftime('%Y-%m-%d')} to {effective_end_date.strftime('%Y-%m-%d')}")

# Filter by release date
before_filter = len(final_merged)
final_merged = final_merged[
    (final_merged['release_date'] >= start_date) & 
    (final_merged['release_date'] <= effective_end_date)
].copy()

filtered_out = before_filter - len(final_merged)
print(f"Rows after date filtering: {len(final_merged)}")
print(f"Rows filtered out: {filtered_out} ({(filtered_out/before_filter*100):.1f}%)")
if len(final_merged) > 0:
    print(f"Date range in dataset: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")



FILTERING BY DATE RANGE (1990 to October 2025)
Initial rows after merge: 4274
Filtering to: 1990-01-01 to 2025-10-30
Rows after date filtering: 4245
Rows filtered out: 29 (0.7%)
Date range in dataset: 1993-02-11 00:00:00 to 2025-10-28 00:00:00


## 5. Clean Up Duplicate/Redundant Columns


In [44]:
initial_cols = len(final_merged.columns)
columns_to_drop = []

# Check for duplicate date columns
if "date_x" in final_merged.columns and "date_y" in final_merged.columns:
    # Keep date_x (from final_df) and drop date_y
    columns_to_drop.append("date_y")
    if "date_x" in final_merged.columns:
        final_merged = final_merged.rename(columns={"date_x": "date"})

# Drop columns with all null values
null_cols = final_merged.columns[final_merged.isnull().all()].tolist()
columns_to_drop.extend(null_cols)

if columns_to_drop:
    final_merged = final_merged.drop(columns=columns_to_drop)
    print(f"Dropped {len(columns_to_drop)} redundant/empty columns")

print(f"Final columns: {len(final_merged.columns)} (reduced from {initial_cols})")

# Remove duplicate movies - intelligently aggregate values when merging duplicates
print("\n" + "=" * 80)
print("REMOVING DUPLICATE MOVIES WITH VALUE AGGREGATION")
print("=" * 80)
initial_rows = len(final_merged)
print(f"Initial rows: {initial_rows}")

# Create a unique movie identifier: prefer title_key (most reliable), then tmdb_id, then imdb_id
# title_key is normalized and should match the same movie even if IDs differ
if 'title_key' in final_merged.columns:
    # Use title_key as primary, but prefer tmdb_id or imdb_id if available for better uniqueness
    final_merged['_movie_id'] = final_merged['title_key'].copy()
    # For movies with same title_key but different IDs, we'll still group them together
    # This handles cases where the same movie appears with different metadata
else:
    # Fallback to IDs if title_key doesn't exist
    if 'imdb_id' in final_merged.columns and 'tmdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['tmdb_id'].fillna(final_merged['imdb_id'])
    elif 'tmdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['tmdb_id']
    elif 'imdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['imdb_id']
    else:
        # Last resort: use title
        final_merged['_movie_id'] = final_merged.get('title', pd.Series(range(len(final_merged))))

# Check for duplicates before aggregation
print("\nChecking for duplicates...")
print(f"Total rows: {len(final_merged)}")
print(f"Unique _movie_id values: {final_merged['_movie_id'].nunique()}")
print(f"Null _movie_id values: {final_merged['_movie_id'].isna().sum()}")

# Check duplicates by _movie_id
duplicate_counts = final_merged['_movie_id'].value_counts()
duplicates = duplicate_counts[duplicate_counts > 1]
print(f"\nMovies with duplicates (by _movie_id): {len(duplicates)}")
print(f"Total duplicate rows: {duplicates.sum() - len(duplicates) if len(duplicates) > 0 else 0}")

# Also check for duplicates by title (in case IDs are missing)
title_dup = pd.Series(dtype=int)
if 'title' in final_merged.columns:
    title_duplicates = final_merged['title'].value_counts()
    title_dup = title_duplicates[title_duplicates > 1]
    print(f"Movies with duplicate titles: {len(title_dup)}")
    if len(title_dup) > 0:
        print(f"Total rows with duplicate titles: {title_dup.sum() - len(title_dup)}")

if len(duplicates) > 0:
    print(f"\nTop 10 movies with most duplicates:")
    for movie_id, count in duplicates.head(10).items():
        movie_rows = final_merged[final_merged['_movie_id'] == movie_id]
        title = movie_rows['title'].iloc[0] if 'title' in movie_rows.columns else 'N/A'
        print(f"  {movie_id}: {count} rows - {title}")
elif len(title_dup) > 0:
    print(f"\nNote: Found {len(title_dup)} movies with duplicate titles but unique _movie_id")
    print("This might indicate the same movie with different IDs. Consider using title_key for deduplication.")

# Define aggregation strategies for different column types
def aggregate_duplicates(group):
    """Aggregate duplicate rows intelligently - returns a Series."""
    if len(group) == 1:
        return group.iloc[0]
    
    result = group.iloc[0].copy()
    
    # For each column, decide how to aggregate
    for col in group.columns:
        if col == '_movie_id':
            continue  # Skip the grouping column
        
        values = group[col].dropna()
        
        if len(values) == 0:
            # All NaN, keep NaN
            result[col] = None
        elif len(values) == 1:
            # Only one non-null value, use it
            result[col] = values.iloc[0]
        else:
            # Multiple non-null values - need to decide
            if col in ['gross', 'total_gross', 'revenue', 'budget', 'theaters', 
                      'vote_count', 'omdb_imdbvotes']:
                # Numeric columns: take maximum (most complete/highest value)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['omdb_metascore']:
                # Metascore: take maximum (best rating)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['per_theater', 'gross_per_theater', 'vote_average', 
                        'popularity', 'omdb_imdbrating']:
                # Average/rate columns: take maximum (best performance)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['date', 'release_date', 'omdb_released']:
                # Date columns: take most recent
                dates = pd.to_datetime(values, errors='coerce')
                result[col] = dates.max()
            elif col in ['days_in_release']:
                # Days in release: take maximum (most complete run)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['title', 'distributor', 'overview', 'omdb_title', 
                        'omdb_plot', 'omdb_director', 'omdb_writer', 'omdb_actors',
                        'genre_names', 'production_company_names']:
                # Text columns: take longest/non-empty (most complete)
                text_values = values.astype(str)
                longest_idx = text_values.str.len().idxmax()
                result[col] = text_values.loc[longest_idx]
            elif col in ['omdb_rating_rottentomatoes', 'omdb_rating_metacritic', 
                        'omdb_rating_internetmoviedatabase']:
                # Rating strings: take first non-null (they should be same)
                result[col] = values.iloc[0]
            else:
                # Default: take first non-null value
                result[col] = values.iloc[0]
    
    return result

# Group by movie_id and aggregate
print("\nAggregating duplicate movies...")
if len(duplicates) > 0:
    final_merged = final_merged.groupby('_movie_id', group_keys=False).apply(aggregate_duplicates).reset_index(drop=True)
else:
    print("No duplicates found - skipping aggregation")

# Remove temporary column
final_merged = final_merged.drop(columns=['_movie_id'], errors='ignore')

duplicates_removed = initial_rows - len(final_merged)
print(f"\nDuplicates removed: {duplicates_removed} rows")
print(f"Final unique movies: {len(final_merged)} rows")
print(f"Reduction: {(duplicates_removed/initial_rows*100):.1f}%")
print(f"\nAggregation strategy:")
print("  - Numeric (gross, revenue, theaters, etc.): Maximum value")
print("  - Dates: Most recent")
print("  - Text (title, overview, etc.): Longest/most complete")
print("  - Ratings: First non-null value")


Final columns: 73 (reduced from 73)

REMOVING DUPLICATE MOVIES WITH VALUE AGGREGATION
Initial rows: 4245

Checking for duplicates...
Total rows: 4245
Unique _movie_id values: 4245
Null _movie_id values: 0

Movies with duplicates (by _movie_id): 0
Total duplicate rows: 0
Movies with duplicate titles: 0

Aggregating duplicate movies...
No duplicates found - skipping aggregation

Duplicates removed: 0 rows
Final unique movies: 4245 rows
Reduction: 0.0%

Aggregation strategy:
  - Numeric (gross, revenue, theaters, etc.): Maximum value
  - Dates: Most recent
  - Text (title, overview, etc.): Longest/most complete
  - Ratings: First non-null value


## 6. Dataset Summary and Missing Value Analysis


In [45]:
print("Final Dataset Summary:")
print(f"Total rows: {len(final_merged)}")
print(f"Total columns: {len(final_merged.columns)}")
print("\nMissing values per column (top 15):")
missing_counts = final_merged.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)
for col, count in missing_counts.head(15).items():
    pct = (count / len(final_merged)) * 100
    print(f"  {col}: {count} ({pct:.1f}%)")


Final Dataset Summary:
Total rows: 4245
Total columns: 73

Missing values per column (top 15):
  omdb_production: 4230 (99.6%)
  belongs_to_collection: 3536 (83.3%)
  omdb_awards: 936 (22.0%)
  omdb_metascore: 806 (19.0%)
  omdb_rating_metacritic: 805 (19.0%)
  omdb_rating_rottentomatoes: 699 (16.5%)
  omdb_boxoffice: 594 (14.0%)
  omdb_rated: 499 (11.8%)
  omdb_writer: 405 (9.5%)
  omdb_rating_internetmoviedatabase: 331 (7.8%)
  omdb_imdbrating: 331 (7.8%)
  omdb_language: 318 (7.5%)
  omdb_imdbvotes: 317 (7.5%)
  omdb_actors: 317 (7.5%)
  omdb_poster: 312 (7.3%)


## 7. Adjust Monetary Values for Inflation (Present Value)

Convert all dollar amounts to present value (2024 dollars) using Consumer Price Index (CPI) to ensure accurate comparisons across different years.


In [46]:
# Adjust monetary values for inflation to present value (2024 dollars)
# Using CPI data from US Bureau of Labor Statistics

# Check if final_merged exists (must run previous cells first)
if 'final_merged' not in globals():
    raise NameError(
        "ERROR: 'final_merged' is not defined. "
        "Please run the previous cells in order:\n"
        "  1. Cell 1: Load final_df\n"
        "  2. Cell 2-3: Load and clean OMDB data\n"
        "  3. Cell 4: Merge datasets\n"
        "  4. Cell 5: Filter by date range\n"
        "  5. Cell 6: Clean up and deduplicate\n"
        "Then run this cell (inflation adjustment)."
    )

# Target year for present value (use most recent year in dataset or 2024)
TARGET_YEAR = 2024

# CPI values (base year 1982-84 = 100) - approximate values for key years
# For production use, consider using the 'cpi' library: pip install cpi
# Or fetch from BLS API: https://www.bls.gov/developers/api_signature.htm
CPI_DATA = {
    1990: 130.7, 1991: 136.2, 1992: 140.3, 1993: 144.5, 1994: 148.2,
    1995: 152.4, 1996: 156.9, 1997: 160.5, 1998: 163.0, 1999: 166.6,
    2000: 172.2, 2001: 177.1, 2002: 179.9, 2003: 184.0, 2004: 188.9,
    2005: 195.3, 2006: 201.6, 2007: 207.3, 2008: 215.3, 2009: 214.5,
    2010: 218.1, 2011: 224.9, 2012: 229.6, 2013: 233.0, 2014: 236.7,
    2015: 237.0, 2016: 240.0, 2017: 245.1, 2018: 251.1, 2019: 255.7,
    2020: 258.8, 2021: 270.9, 2022: 292.7, 2023: 304.7, 2024: 313.0  # Approximate
}

# Get CPI for target year
cpi_target = CPI_DATA.get(TARGET_YEAR, 313.0)

print("=" * 80)
print("ADJUSTING MONETARY VALUES FOR INFLATION")
print("=" * 80)
print(f"Target year (present value): {TARGET_YEAR}")
print(f"CPI for {TARGET_YEAR}: {cpi_target}")
if 'release_date' in final_merged.columns:
    print(f"\nDate range in dataset: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")
else:
    print("\n⚠ Warning: 'release_date' column not found in final_merged")

# Extract year from release_date for CPI lookup
final_merged['release_year'] = pd.to_datetime(final_merged['release_date'], errors='coerce').dt.year

# Function to get CPI for a given year (interpolate if missing)
def get_cpi(year):
    if pd.isna(year) or year < 1990:
        return CPI_DATA.get(1990, 130.7)  # Use 1990 CPI as fallback
    if year > 2024:
        return CPI_DATA.get(2024, 313.0)  # Use 2024 CPI for future years
    # Use exact match if available, otherwise interpolate
    if year in CPI_DATA:
        return CPI_DATA[year]
    # Interpolate between nearest years
    year_int = int(year)
    if year_int < min(CPI_DATA.keys()):
        return CPI_DATA[min(CPI_DATA.keys())]
    if year_int > max(CPI_DATA.keys()):
        return CPI_DATA[max(CPI_DATA.keys())]
    # Find nearest years
    lower_year = max([y for y in CPI_DATA.keys() if y <= year_int], default=min(CPI_DATA.keys()))
    upper_year = min([y for y in CPI_DATA.keys() if y >= year_int], default=max(CPI_DATA.keys()))
    if lower_year == upper_year:
        return CPI_DATA[lower_year]
    # Linear interpolation
    cpi_lower = CPI_DATA[lower_year]
    cpi_upper = CPI_DATA[upper_year]
    return cpi_lower + (cpi_upper - cpi_lower) * (year_int - lower_year) / (upper_year - lower_year)

# Apply CPI to get inflation factor
final_merged['cpi_release_year'] = final_merged['release_year'].apply(get_cpi)
final_merged['inflation_factor'] = cpi_target / final_merged['cpi_release_year']

print(f"\nInflation factors:")
print(f"  Min: {final_merged['inflation_factor'].min():.3f} (oldest movies need ~2.4x adjustment)")
print(f"  Max: {final_merged['inflation_factor'].max():.3f} (newest movies need ~1.0x adjustment)")
print(f"  Mean: {final_merged['inflation_factor'].mean():.3f}")

# Monetary columns to adjust
monetary_columns = ['gross', 'total_gross', 'per_theater', 'gross_per_theater', 'budget', 'revenue']

# Adjust each monetary column
for col in monetary_columns:
    if col in final_merged.columns:
        # Convert to numeric, handling any string values
        final_merged[col] = pd.to_numeric(final_merged[col], errors='coerce')
        # Create adjusted column
        adjusted_col = f"{col}_adjusted_{TARGET_YEAR}"
        final_merged[adjusted_col] = final_merged[col] * final_merged['inflation_factor']
        print(f"\n✓ Adjusted {col} -> {adjusted_col}")
        print(f"  Original range: ${final_merged[col].min():,.0f} to ${final_merged[col].max():,.0f}")
        print(f"  Adjusted range: ${final_merged[adjusted_col].min():,.0f} to ${final_merged[adjusted_col].max():,.0f}")

# Handle omdb_boxoffice (may be in string format with $ and commas)
if 'omdb_boxoffice' in final_merged.columns:
    # Clean and convert omdb_boxoffice
    def clean_currency(value):
        if pd.isna(value) or value == 'None' or value == 'N/A':
            return None
        if isinstance(value, str):
            # Remove $, commas, and spaces
            cleaned = value.replace('$', '').replace(',', '').replace(' ', '')
            try:
                return float(cleaned)
            except:
                return None
        return float(value) if pd.notna(value) else None
    
    final_merged['omdb_boxoffice_clean'] = final_merged['omdb_boxoffice'].apply(clean_currency)
    final_merged['omdb_boxoffice_clean'] = pd.to_numeric(final_merged['omdb_boxoffice_clean'], errors='coerce')
    
    # Adjust omdb_boxoffice (use release_date year for CPI)
    adjusted_col = f"omdb_boxoffice_adjusted_{TARGET_YEAR}"
    final_merged[adjusted_col] = final_merged['omdb_boxoffice_clean'] * final_merged['inflation_factor']
    print(f"\n✓ Adjusted omdb_boxoffice -> {adjusted_col}")
    print(f"  Original range: ${final_merged['omdb_boxoffice_clean'].min():,.0f} to ${final_merged['omdb_boxoffice_clean'].max():,.0f}")
    print(f"  Adjusted range: ${final_merged[adjusted_col].min():,.0f} to ${final_merged[adjusted_col].max():,.0f}")

print(f"\n" + "=" * 80)
print("Inflation adjustment complete!")
print(f"All monetary values are now in {TARGET_YEAR} dollars (present value)")
print("=" * 80)


ADJUSTING MONETARY VALUES FOR INFLATION
Target year (present value): 2024
CPI for 2024: 313.0

Date range in dataset: 1993-02-11 00:00:00 to 2025-10-28 00:00:00

Inflation factors:
  Min: 1.000 (oldest movies need ~2.4x adjustment)
  Max: 2.166 (newest movies need ~1.0x adjustment)
  Mean: 1.427

✓ Adjusted gross -> gross_adjusted_2024
  Original range: $0 to $12,150,000
  Adjusted range: $0 to $14,038,206

✓ Adjusted total_gross -> total_gross_adjusted_2024
  Original range: $400 to $804,617,772
  Adjusted range: $593 to $1,285,862,395

✓ Adjusted per_theater -> per_theater_adjusted_2024
  Original range: $0 to $60,935
  Adjusted range: $0 to $75,956

✓ Adjusted gross_per_theater -> gross_per_theater_adjusted_2024
  Original range: $60 to $inf
  Adjusted range: $87 to $inf

✓ Adjusted budget -> budget_adjusted_2024
  Original range: $0 to $350,000,000
  Adjusted range: $0 to $390,031,153

✓ Adjusted revenue -> revenue_adjusted_2024
  Original range: $0 to $2,923,706,026
  Adjusted ran

In [47]:
final_merged.head()

Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,cpi_release_year,inflation_factor,gross_adjusted_2024,total_gross_adjusted_2024,per_theater_adjusted_2024,gross_per_theater_adjusted_2024,budget_adjusted_2024,revenue_adjusted_2024,omdb_boxoffice_clean,omdb_boxoffice_adjusted_2024
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,240.0,1.304167,14885.758333,94008240.0,123.895833,783402.037743,19562500.0,143741300.0,72082998.0,94008240.0
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,201.6,1.552579,2780.669643,77292.06,464.22123,12882.009755,12420630.0,222719.1,54702.0,84929.2
2,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,214.5,1.459207,7050.890443,17784740.0,243.687646,613266.855076,32102560.0,25215580.0,12234694.0,17852960.0
3,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,251.1,1.246515,5611.812027,56716650.0,58.586221,597017.390094,43628040.0,84078480.0,45819713.0,57114970.0
4,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,188.9,1.656961,190550.55585,90968840.0,164.039174,78151.920415,61307570.0,159896800.0,57231747.0,94830790.0


# 8. Adding columns for weekly gross

In [None]:
def weeks_in_release(row):

    delta_weeks = (pd.to_datetime(row['date'])- pd.to_datetime(row['release_date'])).days // 7

    return min(delta_weeks,12)

final_merged['weeks_in_release'] = final_merged.apply(lambda row: weeks_in_release(row), axis=1)
final_merged.head()

Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,inflation_factor,gross_adjusted_2024,total_gross_adjusted_2024,per_theater_adjusted_2024,gross_per_theater_adjusted_2024,budget_adjusted_2024,revenue_adjusted_2024,omdb_boxoffice_clean,omdb_boxoffice_adjusted_2024,weeks_in_release
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,1.304167,14885.758333,94008240.0,123.895833,783402.037743,19562500.0,143741300.0,72082998.0,94008240.0,12
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,1.552579,2780.669643,77292.06,464.22123,12882.009755,12420630.0,222719.1,54702.0,84929.2,2
2,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,1.459207,7050.890443,17784740.0,243.687646,613266.855076,32102560.0,25215580.0,12234694.0,17852960.0,8
3,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,1.246515,5611.812027,56716650.0,58.586221,597017.390094,43628040.0,84078480.0,45819713.0,57114970.0,10
4,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,1.656961,190550.55585,90968840.0,164.039174,78151.920415,61307570.0,159896800.0,57231747.0,94830790.0,6


In [49]:
final_merged['weekly_gross_adjusted'] = final_merged['total_gross_adjusted_2024'] / final_merged['weeks_in_release']
final_merged['weekly_gross_adjusted_per_theater'] = final_merged['weekly_gross_adjusted'] / final_merged['average_theaters'].fillna(0)

## 9. Preview Final Dataset


In [50]:
# Display first few rows
final_merged.head()


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,total_gross_adjusted_2024,per_theater_adjusted_2024,gross_per_theater_adjusted_2024,budget_adjusted_2024,revenue_adjusted_2024,omdb_boxoffice_clean,omdb_boxoffice_adjusted_2024,weeks_in_release,weekly_gross_adjusted,weekly_gross_adjusted_per_theater
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,94008240.0,123.895833,783402.037743,19562500.0,143741300.0,72082998.0,94008240.0,12,7834020.0,5684.677321
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,77292.06,464.22123,12882.009755,12420630.0,222719.1,54702.0,84929.2,2,38646.03,6441.004878
2,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,17784740.0,243.687646,613266.855076,32102560.0,25215580.0,12234694.0,17852960.0,8,2223092.0,2736.911465
3,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,56716650.0,58.586221,597017.390094,43628040.0,84078480.0,45819713.0,57114970.0,10,5671665.0,4525.044438
4,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,90968840.0,164.039174,78151.920415,61307570.0,159896800.0,57231747.0,94830790.0,6,15161470.0,5773.337874


## 9. Save Final Merged Dataset


In [51]:
# Preview first few rows before saving
print("First few rows of final merged dataset:")
print("=" * 80)
display(final_merged.head(10))
print("\n" + "=" * 80)
print(f"\nDataset shape: {final_merged.shape}")
print(f"Total columns: {len(final_merged.columns)}")

# Check for inflation-adjusted columns
adjusted_cols = [col for col in final_merged.columns if 'adjusted_2024' in col]
if adjusted_cols:
    print(f"✓ Found {len(adjusted_cols)} inflation-adjusted columns:")
    for col in adjusted_cols[:5]:  # Show first 5
        print(f"  - {col}")
    if len(adjusted_cols) > 5:
        print(f"  ... and {len(adjusted_cols) - 5} more")
else:
    print("⚠ WARNING: No inflation-adjusted columns found!")
    print("  → Please run the inflation adjustment cell (Cell 8) before saving")
    print("  → This will add columns like 'gross_adjusted_2024', 'total_gross_adjusted_2024', etc.")

print(f"\nAll columns: {list(final_merged.columns)}")
if 'release_date' in final_merged.columns:
    print(f"Date range: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")
print(f"Unique movies: {final_merged['title'].nunique()}")

# Save final merged dataset with proper CSV quoting to handle newlines and special characters
# Using QUOTE_MINIMAL with doublequote=True ensures proper handling of newlines, quotes, and commas
# This will REPLACE the existing final_merged_dataset.csv
output_path = f"{CLEAN_DATA_PATH}/final_merged_dataset.csv"
final_merged.to_csv(output_path, 
                    index=False, 
                    quoting=csv.QUOTE_MINIMAL,
                    doublequote=True,
                    lineterminator='\n')
print(f"\n✓ Saved final merged dataset to: {output_path}")
print(f"Shape: {final_merged.shape}")
if adjusted_cols:
    print(f"✓ Includes {len(adjusted_cols)} inflation-adjusted columns (2024 dollars)")
else:
    print("⚠ Saved WITHOUT inflation-adjusted columns - re-run inflation cell and save again")
if 'release_date' in final_merged.columns:
    print(f"Date range: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")
print(f"Unique movies: {final_merged['title'].nunique()}")
print(f"\nNote: This file REPLACES the previous final_merged_dataset.csv")


First few rows of final merged dataset:


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,total_gross_adjusted_2024,per_theater_adjusted_2024,gross_per_theater_adjusted_2024,budget_adjusted_2024,revenue_adjusted_2024,omdb_boxoffice_clean,omdb_boxoffice_adjusted_2024,weeks_in_release,weekly_gross_adjusted,weekly_gross_adjusted_per_theater
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,94008240.0,123.895833,783402.0,19562500.0,143741300.0,72082998.0,94008240.0,12,7834020.0,5684.677321
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,77292.06,464.22123,12882.01,12420630.0,222719.1,54702.0,84929.2,2,38646.03,6441.004878
2,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,17784740.0,243.687646,613266.9,32102560.0,25215580.0,12234694.0,17852960.0,8,2223092.0,2736.911465
3,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,56716650.0,58.586221,597017.4,43628040.0,84078480.0,45819713.0,57114970.0,10,5671665.0,4525.044438
4,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,90968840.0,164.039174,78151.92,61307570.0,159896800.0,57231747.0,94830790.0,6,15161470.0,5773.337874
5,AMZN,2007-09-03,1408,MGM,38250,0.0,0.0,218.0,175.0,71519946,...,107987200.0,264.230584,495354.1,37747230.0,200815200.0,71985628.0,108690300.0,10,10798720.0,4716.913499
6,WBD,2001-04-05,15 Minutes,New Line,89000,-0.04,-0.56,936.0,95.0,23917000,...,42270020.0,167.899492,45160.28,106041800.0,99608550.0,24403552.0,43129940.0,4,10567510.0,5343.533605
7,WBD,2006-03-30,16 Blocks,Warner Bros.,214226,0.06,-0.49,2066.0,104.0,34819264,...,54059670.0,161.468254,26166.35,85391870.0,101949700.0,36895141.0,57282630.0,4,13514920.0,5329.226222
8,WBD,2009-08-06,17 Again,Warner Bros.,3942,-0.1,-0.59,27.0,146.0,64167069,...,93633070.0,213.044289,3467891.0,29184150.0,198890000.0,64167069.0,93633070.0,16,5852067.0,5119.238478
9,CMCSA,2020-03-19,1917,Universal,225,-0.7,-1.0,766.0,0.0,159227644,...,194909100.0,0.0,254450.5,122409100.0,546023200.0,159227644.0,194909100.0,12,16242420.0,6889.899623




Dataset shape: (4245, 87)
Total columns: 87
✓ Found 7 inflation-adjusted columns:
  - gross_adjusted_2024
  - total_gross_adjusted_2024
  - per_theater_adjusted_2024
  - gross_per_theater_adjusted_2024
  - budget_adjusted_2024
  ... and 2 more

All columns: ['ticker', 'date', 'title', 'distributor', 'gross', 'percent_yd', 'percent_lw', 'theaters', 'per_theater', 'total_gross', 'days_in_release', 'parent company', 'release_date', 'year', 'title_key', 'weekday', 'release_month', 'release_weekday', 'is_weekend', 'average_theaters', 'average_gross', 'average_gross_per_theaters', 'avg_gross_monday', 'avg_gross_tuesday', 'avg_gross_wednesday', 'avg_gross_thursday', 'avg_gross_friday', 'avg_gross_saturday', 'avg_gross_sunday', 'avg_gross_weekend', 'avg_gross_weekday', 'tmdb_id', 'popularity', 'imdb_id', 'original_language', 'status', 'budget', 'revenue', 'adult', 'overview', 'runtime', 'vote_average', 'vote_count', 'origin_country', 'spoken_languages', 'genre_ids', 'genre_names', 'productio

## 10. Generate Excluded Movie Files


In [52]:
# Load the original final_df to identify excluded movies
# (before date filtering was applied)
print("=" * 80)
print("GENERATING EXCLUDED MOVIE FILES")
print("=" * 80)

# Reload final_df to get all movies before filtering
final_df_all = pd.read_csv(f'{CLEAN_DATA_PATH}/final_df.csv')
final_df_all['release_date'] = pd.to_datetime(final_df_all['release_date'], errors='coerce')

# Merge with OMDB data to get complete information
final_df_all_merged = final_df_all.merge(
    omdb_cleaned,
    on="imdb_id",
    how="left",
    suffixes=("", "_omdb")
)

# Define date boundaries (same as filtering step)
start_date = pd.Timestamp('1990-01-01')
end_date = pd.Timestamp('2025-10-31')
today = datetime.now()
cutoff_date = today - timedelta(days=30)
cutoff_timestamp = pd.Timestamp(cutoff_date)
effective_end_date = min(end_date, cutoff_timestamp)

# Identify movies before 1990
movies_before_1990 = final_df_all_merged[
    (final_df_all_merged['release_date'] < start_date) &
    (final_df_all_merged['release_date'].notna())
].copy()

# Identify movies after October 2025 (or last 30 days)
movies_after_2025_10 = final_df_all_merged[
    (final_df_all_merged['release_date'] > effective_end_date) &
    (final_df_all_merged['release_date'].notna())
].copy()

# Save excluded movies
if len(movies_before_1990) > 0:
    movies_before_1990_path = f"{CLEAN_DATA_PATH}/movies_before_1990.csv"
    movies_before_1990.to_csv(movies_before_1990_path, index=False, 
                              quoting=csv.QUOTE_MINIMAL, doublequote=True, lineterminator='\n')
    print(f"\n✓ Saved {len(movies_before_1990)} movies before 1990 to: {movies_before_1990_path}")
    if len(movies_before_1990) > 0:
        print(f"  Date range: {movies_before_1990['release_date'].min()} to {movies_before_1990['release_date'].max()}")
else:
    print("\n✓ No movies before 1990 found")

if len(movies_after_2025_10) > 0:
    movies_after_2025_10_path = f"{CLEAN_DATA_PATH}/movies_after_2025_10.csv"
    movies_after_2025_10.to_csv(movies_after_2025_10_path, index=False,
                                quoting=csv.QUOTE_MINIMAL, doublequote=True, lineterminator='\n')
    print(f"\n✓ Saved {len(movies_after_2025_10)} movies after {effective_end_date.strftime('%Y-%m-%d')} to: {movies_after_2025_10_path}")
    if len(movies_after_2025_10) > 0:
        print(f"  Date range: {movies_after_2025_10['release_date'].min()} to {movies_after_2025_10['release_date'].max()}")
else:
    print(f"\n✓ No movies after {effective_end_date.strftime('%Y-%m-%d')} found")

print(f"\nSummary:")
print(f"  Total movies in final dataset: {len(final_merged)}")
print(f"  Total movies excluded: {len(movies_before_1990) + len(movies_after_2025_10)}")
print(f"  Total movies in original dataset: {len(final_df_all_merged)}")


GENERATING EXCLUDED MOVIE FILES

✓ Saved 25 movies before 1990 to: ../data/cleaned/movies_before_1990.csv
  Date range: 1936-02-04 00:00:00 to 1989-08-08 00:00:00

✓ Saved 4 movies after 2025-10-30 to: ../data/cleaned/movies_after_2025_10.csv
  Date range: 2025-11-06 00:00:00 to 2025-11-06 00:00:00

Summary:
  Total movies in final dataset: 4245
  Total movies excluded: 29
  Total movies in original dataset: 4274
