# Comprehensive Data Merge: Final Dataset with OMDB

This notebook merges the final cleaned dataset (SOVAI + TMDB) with OMDB ratings data.

**Key improvements over previous merge:**
- Uses LEFT JOIN to preserve ALL movies from final_df.csv (not just those with OMDB ratings)
- Properly handles duplicate columns from merges
- Cleans up redundant/empty columns
- Provides detailed merge statistics


In [27]:
import pandas as pd
import glob
import os
import csv
from pathlib import Path


## 1. Load Final Dataset (SOVAI + TMDB merged)


In [28]:
CLEAN_DATA_PATH = "../data/cleaned"
OMDB_DATA_PATH = "../omdb_api"

# Load final_df (already has SOVAI + TMDB merged and filtered)
final_df = pd.read_csv(f'{CLEAN_DATA_PATH}/final_df.csv')
print(f"Loaded {len(final_df)} rows, {len(final_df.columns)} columns")
print(f"Movies with imdb_id: {final_df['imdb_id'].notna().sum()}")
print(f"Movies without imdb_id: {final_df['imdb_id'].isna().sum()}")
final_df.head()


Loaded 10067 rows, 39 columns
Movies with imdb_id: 8522
Movies without imdb_id: 1545


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,vote_average,vote_count,origin_country,spoken_languages,genre_ids,genre_names,production_company_ids,production_company_names,belongs_to_collection,gross_per_theater
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,7.0,8351.0,US,English,"53, 878, 18, 27","Thriller, Science Fiction, Drama, Horror",11461,Bad Robot,,95.116667
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,5.856,108.0,US,English,"28, 80, 18, 9648, 53","Action, Crime, Drama, Mystery, Thriller",41427,Suzanne DeLaurentiis Productions,,298.5
2,6758,2009-05-25,12,Sony Picture…,344,0.0,0.0,5.0,69.0,119587,...,5.6,50.0,US,English,18,Drama,,,,68.8
3,6758,2009-05-25,12,Sony Picture…,344,0.0,0.0,5.0,69.0,119587,...,5.0,57.0,US,English,35,Comedy,,,,68.8
4,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,5.904,819.0,US,English,"28, 53, 80","Action, Thriller, Crime","1557, 17887, 2890, 10339","The Mark Gordon Company, Midnight Sun Pictures...",12 Rounds Collection,166.62069


## 2. Load and Combine OMDB Batch Files


In [29]:
# Find all OMDB batch files
csv_files = sorted(glob.glob(f"{OMDB_DATA_PATH}/omdbmovies_batch_*.csv"))
print(f"Found {len(csv_files)} OMDB batch files")

# Load and combine all batches
dfs = []
for file in csv_files:
    df = pd.read_csv(file)
    # Remove rows where Title is null or empty
    df = df[df["Title"].notna()]
    df = df[df["Title"].str.strip() != ""]
    dfs.append(df)
    print(f"  Loaded {os.path.basename(file)}: {len(df)} rows")

# Combine all batches
omdb_merged = pd.concat(dfs, ignore_index=True)
print(f"\nTotal OMDB records before deduplication: {len(omdb_merged)}")

# Remove duplicates based on imdbID (keep first occurrence)
initial_count = len(omdb_merged)
omdb_merged = omdb_merged.drop_duplicates(subset=["imdbID"], keep="first")
duplicates_removed = initial_count - len(omdb_merged)
if duplicates_removed > 0:
    print(f"Removed {duplicates_removed} duplicate entries")

print(f"Total OMDB records: {len(omdb_merged)}")
print(f"Unique IMDb IDs: {omdb_merged['imdbID'].nunique()}")


Found 11 OMDB batch files
  Loaded omdbmovies_batch_0.csv: 814 rows
  Loaded omdbmovies_batch_1.csv: 833 rows
  Loaded omdbmovies_batch_10.csv: 59 rows
  Loaded omdbmovies_batch_2.csv: 861 rows
  Loaded omdbmovies_batch_3.csv: 824 rows
  Loaded omdbmovies_batch_4.csv: 848 rows
  Loaded omdbmovies_batch_5.csv: 846 rows
  Loaded omdbmovies_batch_6.csv: 811 rows
  Loaded omdbmovies_batch_7.csv: 860 rows
  Loaded omdbmovies_batch_8.csv: 848 rows
  Loaded omdbmovies_batch_9.csv: 877 rows

Total OMDB records before deduplication: 8481
Total OMDB records: 8481
Unique IMDb IDs: 8481


## 3. Clean OMDB Data


In [30]:
# Rename imdbID to imdb_id for consistency
omdb_merged = omdb_merged.rename(columns={"imdbID": "imdb_id"})

# Filter to movies released after 1990 (matching final_df filter)
omdb_merged["omdb_release_date"] = pd.to_datetime(omdb_merged["Released"], errors="coerce")
omdb_merged = omdb_merged[omdb_merged["omdb_release_date"] >= pd.Timestamp("1990-01-01")]
print(f"After filtering to post-1990 releases: {len(omdb_merged)} rows")

# Select relevant columns (exclude redundant ones like Type, Season, Episode, etc.)
columns_to_keep = [
    "imdb_id",
    "Title",
    "Year",
    "Rated",
    "Released",
    "Runtime",
    "Genre",
    "Director",
    "Writer",
    "Actors",
    "Plot",
    "Language",
    "Country",
    "Awards",
    "Poster",
    "Metascore",
    "imdbRating",
    "imdbVotes",
    "BoxOffice",
    "Production",
    "Rating_InternetMovieDatabase",
    "Rating_RottenTomatoes",
    "Rating_Metacritic",
]

# Only keep columns that exist in the dataframe
available_columns = [col for col in columns_to_keep if col in omdb_merged.columns]
omdb_cleaned = omdb_merged[available_columns].copy()

# Add prefix to OMDB columns to avoid conflicts (except imdb_id which is the merge key)
omdb_columns = {col: f"omdb_{col.lower()}" if col != "imdb_id" else col 
                for col in omdb_cleaned.columns}
omdb_cleaned = omdb_cleaned.rename(columns=omdb_columns)

print(f"Final OMDB data: {len(omdb_cleaned)} rows, {len(omdb_cleaned.columns)} columns")
omdb_cleaned.head()


After filtering to post-1990 releases: 6871 rows
Final OMDB data: 6871 rows, 23 columns


Unnamed: 0,imdb_id,omdb_title,omdb_year,omdb_rated,omdb_released,omdb_runtime,omdb_genre,omdb_director,omdb_writer,omdb_actors,...,omdb_awards,omdb_poster,omdb_metascore,omdb_imdbrating,omdb_imdbvotes,omdb_boxoffice,omdb_production,omdb_rating_internetmoviedatabase,omdb_rating_rottentomatoes,omdb_rating_metacritic
1,tt9362736,Die My Love,2025,R,07 Nov 2025,119 min,"Drama, Thriller",Lynne Ramsay,"Enda Walsh, Lynne Ramsay, Alice Birch","Jennifer Lawrence, Robert Pattinson, Sissy Spacek",...,10 nominations total,https://m.media-amazon.com/images/M/MV5BYjc5OW...,72.0,6.6,9529.0,"$4,884,888",,6.6/10,,72/100
2,tt29567915,Nuremberg,2025,PG-13,07 Nov 2025,148 min,"Drama, History, Thriller",James Vanderbilt,"Jack El-Hai, James Vanderbilt","Rami Malek, Russell Crowe, Richard E. Grant",...,1 win & 4 nominations total,https://m.media-amazon.com/images/M/MV5BMjZhNG...,,,,,,,67%,
3,tt31227572,Predator: Badlands,2025,PG-13,07 Nov 2025,107 min,"Action, Adventure, Sci-Fi",Dan Trachtenberg,"Patrick Aison, Jim Thomas, John Thomas","Elle Fanning, Dimitrius Schuster-Koloamatangi",...,,https://m.media-amazon.com/images/M/MV5BNTdjZG...,71.0,7.6,18100.0,"$40,000,000",,7.6/10,85%,71/100
4,tt12583926,Anniversary,2025,R,29 Oct 2025,,Thriller,Jan Komasa,"Lori Rosene-Gambino, Jan Komasa","Diane Lane, Kyle Chandler, Zoey Deutch",...,,,,,,,,,62%,
5,tt14661372,Anniversary,2021,,26 Aug 2021,7 min,"Short, Horror",Craig Ouellette,Craig Ouellette,"David Crane, David T. Crane, Katie Peabody",...,1 win,https://m.media-amazon.com/images/M/MV5BZjQ2Yj...,,,,,,,,


## 4. Merge Final Dataset with OMDB Data

**Important:** We use LEFT JOIN to preserve ALL movies from final_df, even if they don't have OMDB data.


In [31]:
# Left join to preserve all movies from final_df
final_merged = final_df.merge(
    omdb_cleaned,
    on="imdb_id",
    how="left",  # Keep all movies from final_df
    suffixes=("", "_omdb")
)

print(f"Merge complete: {len(final_merged)} rows, {len(final_merged.columns)} columns")
if 'omdb_title' in final_merged.columns:
    print(f"Movies with OMDB data: {final_merged['omdb_title'].notna().sum()}")
    print(f"Movies without OMDB data: {final_merged['omdb_title'].isna().sum()}")
    print(f"Percentage with OMDB data: {(final_merged['omdb_title'].notna().sum() / len(final_merged) * 100):.1f}%")
else:
    print("Warning: OMDB data columns not found in merged dataset")


Merge complete: 10067 rows, 61 columns
Movies with OMDB data: 6869
Movies without OMDB data: 3198
Percentage with OMDB data: 68.2%


In [32]:
# Filter to movies from 1990 to October 2025 (excluding last 30 days)
from datetime import datetime, timedelta

# Convert release_date to datetime if not already
final_merged['release_date'] = pd.to_datetime(final_merged['release_date'], errors='coerce')

# Set date range: 1990-01-01 to 2025-10-31
start_date = pd.Timestamp('1990-01-01')
end_date = pd.Timestamp('2025-10-31')

# Also exclude movies from last 30 days (if any are after Oct 31)
today = datetime.now()
cutoff_date = today - timedelta(days=30)
cutoff_timestamp = pd.Timestamp(cutoff_date)

# Use the earlier of end_date or cutoff_date
effective_end_date = min(end_date, cutoff_timestamp)

print("\n" + "=" * 80)
print("FILTERING BY DATE RANGE (1990 to October 2025)")
print("=" * 80)
print(f"Initial rows after merge: {len(final_merged)}")
print(f"Filtering to: {start_date.strftime('%Y-%m-%d')} to {effective_end_date.strftime('%Y-%m-%d')}")

# Filter by release date
before_filter = len(final_merged)
final_merged = final_merged[
    (final_merged['release_date'] >= start_date) & 
    (final_merged['release_date'] <= effective_end_date)
].copy()

filtered_out = before_filter - len(final_merged)
print(f"Rows after date filtering: {len(final_merged)}")
print(f"Rows filtered out: {filtered_out} ({(filtered_out/before_filter*100):.1f}%)")
if len(final_merged) > 0:
    print(f"Date range in dataset: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")



FILTERING BY DATE RANGE (1990 to October 2025)
Initial rows after merge: 10067
Filtering to: 1990-01-01 to 2025-10-26
Rows after date filtering: 10004
Rows filtered out: 63 (0.6%)
Date range in dataset: 1993-02-11 00:00:00 to 2025-10-23 00:00:00


## 5. Clean Up Duplicate/Redundant Columns


In [33]:
initial_cols = len(final_merged.columns)
columns_to_drop = []

# Check for duplicate date columns
if "date_x" in final_merged.columns and "date_y" in final_merged.columns:
    # Keep date_x (from final_df) and drop date_y
    columns_to_drop.append("date_y")
    if "date_x" in final_merged.columns:
        final_merged = final_merged.rename(columns={"date_x": "date"})

# Drop columns with all null values
null_cols = final_merged.columns[final_merged.isnull().all()].tolist()
columns_to_drop.extend(null_cols)

if columns_to_drop:
    final_merged = final_merged.drop(columns=columns_to_drop)
    print(f"Dropped {len(columns_to_drop)} redundant/empty columns")

print(f"Final columns: {len(final_merged.columns)} (reduced from {initial_cols})")

# Remove duplicate movies - intelligently aggregate values when merging duplicates
print("\n" + "=" * 80)
print("REMOVING DUPLICATE MOVIES WITH VALUE AGGREGATION")
print("=" * 80)
initial_rows = len(final_merged)
print(f"Initial rows: {initial_rows}")

# Create a unique movie identifier: prefer title_key (most reliable), then tmdb_id, then imdb_id
# title_key is normalized and should match the same movie even if IDs differ
if 'title_key' in final_merged.columns:
    # Use title_key as primary, but prefer tmdb_id or imdb_id if available for better uniqueness
    final_merged['_movie_id'] = final_merged['title_key'].copy()
    # For movies with same title_key but different IDs, we'll still group them together
    # This handles cases where the same movie appears with different metadata
else:
    # Fallback to IDs if title_key doesn't exist
    if 'imdb_id' in final_merged.columns and 'tmdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['tmdb_id'].fillna(final_merged['imdb_id'])
    elif 'tmdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['tmdb_id']
    elif 'imdb_id' in final_merged.columns:
        final_merged['_movie_id'] = final_merged['imdb_id']
    else:
        # Last resort: use title
        final_merged['_movie_id'] = final_merged.get('title', pd.Series(range(len(final_merged))))

# Check for duplicates before aggregation
print("\nChecking for duplicates...")
print(f"Total rows: {len(final_merged)}")
print(f"Unique _movie_id values: {final_merged['_movie_id'].nunique()}")
print(f"Null _movie_id values: {final_merged['_movie_id'].isna().sum()}")

# Check duplicates by _movie_id
duplicate_counts = final_merged['_movie_id'].value_counts()
duplicates = duplicate_counts[duplicate_counts > 1]
print(f"\nMovies with duplicates (by _movie_id): {len(duplicates)}")
print(f"Total duplicate rows: {duplicates.sum() - len(duplicates) if len(duplicates) > 0 else 0}")

# Also check for duplicates by title (in case IDs are missing)
title_dup = pd.Series(dtype=int)
if 'title' in final_merged.columns:
    title_duplicates = final_merged['title'].value_counts()
    title_dup = title_duplicates[title_duplicates > 1]
    print(f"Movies with duplicate titles: {len(title_dup)}")
    if len(title_dup) > 0:
        print(f"Total rows with duplicate titles: {title_dup.sum() - len(title_dup)}")

if len(duplicates) > 0:
    print(f"\nTop 10 movies with most duplicates:")
    for movie_id, count in duplicates.head(10).items():
        movie_rows = final_merged[final_merged['_movie_id'] == movie_id]
        title = movie_rows['title'].iloc[0] if 'title' in movie_rows.columns else 'N/A'
        print(f"  {movie_id}: {count} rows - {title}")
elif len(title_dup) > 0:
    print(f"\nNote: Found {len(title_dup)} movies with duplicate titles but unique _movie_id")
    print("This might indicate the same movie with different IDs. Consider using title_key for deduplication.")

# Define aggregation strategies for different column types
def aggregate_duplicates(group):
    """Aggregate duplicate rows intelligently - returns a Series."""
    if len(group) == 1:
        return group.iloc[0]
    
    result = group.iloc[0].copy()
    
    # For each column, decide how to aggregate
    for col in group.columns:
        if col == '_movie_id':
            continue  # Skip the grouping column
        
        values = group[col].dropna()
        
        if len(values) == 0:
            # All NaN, keep NaN
            result[col] = None
        elif len(values) == 1:
            # Only one non-null value, use it
            result[col] = values.iloc[0]
        else:
            # Multiple non-null values - need to decide
            if col in ['gross', 'total_gross', 'revenue', 'budget', 'theaters', 
                      'vote_count', 'omdb_imdbvotes']:
                # Numeric columns: take maximum (most complete/highest value)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['omdb_metascore']:
                # Metascore: take maximum (best rating)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['per_theater', 'gross_per_theater', 'vote_average', 
                        'popularity', 'omdb_imdbrating']:
                # Average/rate columns: take maximum (best performance)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['date', 'release_date', 'omdb_released']:
                # Date columns: take most recent
                dates = pd.to_datetime(values, errors='coerce')
                result[col] = dates.max()
            elif col in ['days_in_release']:
                # Days in release: take maximum (most complete run)
                numeric_vals = pd.to_numeric(values, errors='coerce')
                result[col] = numeric_vals.max()
            elif col in ['title', 'distributor', 'overview', 'omdb_title', 
                        'omdb_plot', 'omdb_director', 'omdb_writer', 'omdb_actors',
                        'genre_names', 'production_company_names']:
                # Text columns: take longest/non-empty (most complete)
                text_values = values.astype(str)
                longest_idx = text_values.str.len().idxmax()
                result[col] = text_values.loc[longest_idx]
            elif col in ['omdb_rating_rottentomatoes', 'omdb_rating_metacritic', 
                        'omdb_rating_internetmoviedatabase']:
                # Rating strings: take first non-null (they should be same)
                result[col] = values.iloc[0]
            else:
                # Default: take first non-null value
                result[col] = values.iloc[0]
    
    return result

# Group by movie_id and aggregate
print("\nAggregating duplicate movies...")
if len(duplicates) > 0:
    final_merged = final_merged.groupby('_movie_id', group_keys=False).apply(aggregate_duplicates).reset_index(drop=True)
else:
    print("No duplicates found - skipping aggregation")

# Remove temporary column
final_merged = final_merged.drop(columns=['_movie_id'], errors='ignore')

duplicates_removed = initial_rows - len(final_merged)
print(f"\nDuplicates removed: {duplicates_removed} rows")
print(f"Final unique movies: {len(final_merged)} rows")
print(f"Reduction: {(duplicates_removed/initial_rows*100):.1f}%")
print(f"\nAggregation strategy:")
print("  - Numeric (gross, revenue, theaters, etc.): Maximum value")
print("  - Dates: Most recent")
print("  - Text (title, overview, etc.): Longest/most complete")
print("  - Ratings: First non-null value")


Final columns: 61 (reduced from 61)

REMOVING DUPLICATE MOVIES WITH VALUE AGGREGATION
Initial rows: 10004

Checking for duplicates...
Total rows: 10004
Unique _movie_id values: 4429
Null _movie_id values: 0

Movies with duplicates (by _movie_id): 1616
Total duplicate rows: 5575
Movies with duplicate titles: 1616
Total rows with duplicate titles: 5575

Top 10 movies with most duplicates:
  home: 51 rows - Home
  the gift: 29 rows - The Gift
  the end: 29 rows - The End
  trapped: 27 rows - Trapped
  macbeth: 26 rows - Macbeth
  limbo: 26 rows - Limbo
  the box: 25 rows - The Box
  hamlet: 24 rows - Hamlet
  brothers: 24 rows - Brothers
  sanctuary: 23 rows - Sanctuary

Aggregating duplicate movies...

Duplicates removed: 5575 rows
Final unique movies: 4429 rows
Reduction: 55.7%

Aggregation strategy:
  - Numeric (gross, revenue, theaters, etc.): Maximum value
  - Dates: Most recent
  - Text (title, overview, etc.): Longest/most complete
  - Ratings: First non-null value


  final_merged = final_merged.groupby('_movie_id', group_keys=False).apply(aggregate_duplicates).reset_index(drop=True)


## 6. Dataset Summary and Missing Value Analysis


In [34]:
print("Final Dataset Summary:")
print(f"Total rows: {len(final_merged)}")
print(f"Total columns: {len(final_merged.columns)}")
print("\nMissing values per column (top 15):")
missing_counts = final_merged.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)
for col, count in missing_counts.head(15).items():
    pct = (count / len(final_merged)) * 100
    print(f"  {col}: {count} ({pct:.1f}%)")


Final Dataset Summary:
Total rows: 4429
Total columns: 61

Missing values per column (top 15):
  omdb_production: 4393 (99.2%)
  belongs_to_collection: 3621 (81.8%)
  omdb_awards: 613 (13.8%)
  omdb_metascore: 456 (10.3%)
  omdb_rating_metacritic: 455 (10.3%)
  omdb_rating_rottentomatoes: 351 (7.9%)
  omdb_imdbvotes: 238 (5.4%)
  omdb_rated: 233 (5.3%)
  omdb_boxoffice: 216 (4.9%)
  production_company_ids: 160 (3.6%)
  production_company_names: 160 (3.6%)
  omdb_writer: 154 (3.5%)
  omdb_imdbrating: 99 (2.2%)
  omdb_rating_internetmoviedatabase: 99 (2.2%)
  omdb_language: 83 (1.9%)


## 7. Preview Final Dataset


In [35]:
# Display first few rows
final_merged.head()


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,omdb_awards,omdb_poster,omdb_metascore,omdb_imdbrating,omdb_imdbvotes,omdb_boxoffice,omdb_production,omdb_rating_internetmoviedatabase,omdb_rating_rottentomatoes,omdb_rating_metacritic
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,16 wins & 48 nominations total,https://m.media-amazon.com/images/M/MV5BMjEzMj...,76.0,7.2,377108.0,"$72,082,998",,7.2/10,91%,76/100
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,1 win,https://m.media-amazon.com/images/M/MV5BMjE1ND...,36.0,6.3,7033.0,"$54,702",,6.3/10,19%,36/100
2,6758,2009-05-25,12,Sony Picture…,344,0.0,0.0,5.0,69.0,119587,...,,https://m.media-amazon.com/images/M/MV5BN2I5Yj...,,,,,,,,
3,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,,https://m.media-amazon.com/images/M/MV5BZDI5NG...,38.0,5.6,30927.0,"$12,234,694",,5.6/10,31%,38/100
4,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,3 nominations total,https://m.media-amazon.com/images/M/MV5BNTEzMj...,54.0,6.5,97951.0,"$45,819,713",,6.5/10,50%,54/100


## 8. Save Final Merged Dataset


In [38]:
# Preview first few rows before saving
print("First few rows of final merged dataset:")
print("=" * 80)
display(final_merged.head(10))
print("\n" + "=" * 80)
print(f"\nDataset shape: {final_merged.shape}")
print(f"Columns: {list(final_merged.columns)}")
if 'release_date' in final_merged.columns:
    print(f"Date range: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")
print(f"Unique movies: {final_merged['title'].nunique()}")

# Save final merged dataset with proper CSV quoting to handle newlines and special characters
# Using QUOTE_MINIMAL with doublequote=True ensures proper handling of newlines, quotes, and commas
# This will REPLACE the existing final_merged_dataset.csv
output_path = f"{CLEAN_DATA_PATH}/final_merged_dataset.csv"
final_merged.to_csv(output_path, 
                    index=False, 
                    quoting=csv.QUOTE_MINIMAL,
                    doublequote=True,
                    lineterminator='\n')
print(f"\n✓ Saved final merged dataset to: {output_path}")
print(f"Shape: {final_merged.shape}")
if 'release_date' in final_merged.columns:
    print(f"Date range: {final_merged['release_date'].min()} to {final_merged['release_date'].max()}")
print(f"Unique movies: {final_merged['title'].nunique()}")
print(f"\nNote: This file REPLACES the previous final_merged_dataset.csv")


First few rows of final merged dataset:


Unnamed: 0,ticker,date,title,distributor,gross,percent_yd,percent_lw,theaters,per_theater,total_gross,...,omdb_awards,omdb_poster,omdb_metascore,omdb_imdbrating,omdb_imdbvotes,omdb_boxoffice,omdb_production,omdb_rating_internetmoviedatabase,omdb_rating_rottentomatoes,omdb_rating_metacritic
0,PARA,2016-06-02,10 Cloverfield Lane,Paramount Pi…,11414,0.32,-0.12,120.0,95.0,72082999,...,16 wins & 48 nominations total,https://m.media-amazon.com/images/M/MV5BMjEzMj...,76.0,7.2,377108.0,"$72,082,998",,7.2/10,91%,76/100
1,Private,2006-09-04,10th & Wolf,ThinkFilm,1791,0.0,0.0,6.0,299.0,49783,...,1 win,https://m.media-amazon.com/images/M/MV5BMjE1ND...,36.0,6.3,7033.0,"$54,702",,6.3/10,19%,36/100
2,6758,2009-05-25,12,Sony Picture…,344,0.0,0.0,5.0,69.0,119587,...,,https://m.media-amazon.com/images/M/MV5BN2I5Yj...,,,,,,,,
3,DIS,2009-05-25,12 Rounds,20th Century…,4832,0.0,0.98,29.0,167.0,12187944,...,,https://m.media-amazon.com/images/M/MV5BZDI5NG...,38.0,5.6,30927.0,"$12,234,694",,5.6/10,31%,38/100
4,WBD,2018-03-29,12 Strong,Warner Bros.,4502,0.08,-0.45,95.0,47.0,45500164,...,3 nominations total,https://m.media-amazon.com/images/M/MV5BNTEzMj...,54.0,6.5,97951.0,"$45,819,713",,6.5/10,50%,54/100
5,SONY,2004-06-03,13 Going On 30,Sony Pictures,115000,0.01,-0.59,1164.0,99.0,54901000,...,11 nominations total,https://m.media-amazon.com/images/M/MV5BMjE1Nz...,57.0,6.3,239662.0,"$57,231,747",,6.3/10,65%,57/100
6,AMZN,2007-09-03,1408,MGM,38250,0.0,0.0,218.0,175.0,71519946,...,4 wins & 12 nominations total,https://m.media-amazon.com/images/M/MV5BMjQ2OD...,64.0,6.8,309249.0,"$71,985,628",,6.8/10,79%,64/100
7,WBD,2001-04-05,15 Minutes,New Line,89000,-0.04,-0.56,936.0,95.0,23917000,...,1 nomination total,https://m.media-amazon.com/images/M/MV5BOTg5MD...,34.0,6.1,53088.0,"$24,403,552",,6.1/10,32%,34/100
8,WBD,2006-03-30,16 Blocks,Warner Bros.,214226,0.06,-0.49,2066.0,104.0,34819264,...,2 nominations total,https://m.media-amazon.com/images/M/MV5BMTQ1ND...,63.0,6.6,137629.0,"$36,895,141",,6.6/10,55%,63/100
9,WBD,2009-08-06,17 Again,Warner Bros.,3942,-0.1,-0.59,27.0,146.0,64167069,...,3 wins & 5 nominations total,https://m.media-amazon.com/images/M/MV5BMjA2NT...,48.0,6.4,229054.0,"$64,167,069",,6.4/10,57%,48/100




Dataset shape: (4429, 61)
Columns: ['ticker', 'date', 'title', 'distributor', 'gross', 'percent_yd', 'percent_lw', 'theaters', 'per_theater', 'total_gross', 'days_in_release', 'parent company', 'release_date', 'year', 'title_key', 'tmdb_id', 'popularity', 'weekday', 'release_month', 'release_weekday', 'is_weekend', 'imdb_id', 'original_language', 'status', 'budget', 'revenue', 'adult', 'overview', 'runtime', 'vote_average', 'vote_count', 'origin_country', 'spoken_languages', 'genre_ids', 'genre_names', 'production_company_ids', 'production_company_names', 'belongs_to_collection', 'gross_per_theater', 'omdb_title', 'omdb_year', 'omdb_rated', 'omdb_released', 'omdb_runtime', 'omdb_genre', 'omdb_director', 'omdb_writer', 'omdb_actors', 'omdb_plot', 'omdb_language', 'omdb_country', 'omdb_awards', 'omdb_poster', 'omdb_metascore', 'omdb_imdbrating', 'omdb_imdbvotes', 'omdb_boxoffice', 'omdb_production', 'omdb_rating_internetmoviedatabase', 'omdb_rating_rottentomatoes', 'omdb_rating_metacri

## 9. Generate Excluded Movie Files

Save movies that were excluded from the final dataset (before 1990 and after October 2025) for reference.


In [36]:
# Load the original final_df to identify excluded movies
# (before date filtering was applied)
print("=" * 80)
print("GENERATING EXCLUDED MOVIE FILES")
print("=" * 80)

# Reload final_df to get all movies before filtering
final_df_all = pd.read_csv(f'{CLEAN_DATA_PATH}/final_df.csv')
final_df_all['release_date'] = pd.to_datetime(final_df_all['release_date'], errors='coerce')

# Merge with OMDB data to get complete information
final_df_all_merged = final_df_all.merge(
    omdb_cleaned,
    on="imdb_id",
    how="left",
    suffixes=("", "_omdb")
)

# Define date boundaries (same as filtering step)
start_date = pd.Timestamp('1990-01-01')
end_date = pd.Timestamp('2025-10-31')
today = datetime.now()
cutoff_date = today - timedelta(days=30)
cutoff_timestamp = pd.Timestamp(cutoff_date)
effective_end_date = min(end_date, cutoff_timestamp)

# Identify movies before 1990
movies_before_1990 = final_df_all_merged[
    (final_df_all_merged['release_date'] < start_date) &
    (final_df_all_merged['release_date'].notna())
].copy()

# Identify movies after October 2025 (or last 30 days)
movies_after_2025_10 = final_df_all_merged[
    (final_df_all_merged['release_date'] > effective_end_date) &
    (final_df_all_merged['release_date'].notna())
].copy()

# Save excluded movies
if len(movies_before_1990) > 0:
    movies_before_1990_path = f"{CLEAN_DATA_PATH}/movies_before_1990.csv"
    movies_before_1990.to_csv(movies_before_1990_path, index=False, 
                              quoting=csv.QUOTE_MINIMAL, doublequote=True, lineterminator='\n')
    print(f"\n✓ Saved {len(movies_before_1990)} movies before 1990 to: {movies_before_1990_path}")
    if len(movies_before_1990) > 0:
        print(f"  Date range: {movies_before_1990['release_date'].min()} to {movies_before_1990['release_date'].max()}")
else:
    print("\n✓ No movies before 1990 found")

if len(movies_after_2025_10) > 0:
    movies_after_2025_10_path = f"{CLEAN_DATA_PATH}/movies_after_2025_10.csv"
    movies_after_2025_10.to_csv(movies_after_2025_10_path, index=False,
                                quoting=csv.QUOTE_MINIMAL, doublequote=True, lineterminator='\n')
    print(f"\n✓ Saved {len(movies_after_2025_10)} movies after {effective_end_date.strftime('%Y-%m-%d')} to: {movies_after_2025_10_path}")
    if len(movies_after_2025_10) > 0:
        print(f"  Date range: {movies_after_2025_10['release_date'].min()} to {movies_after_2025_10['release_date'].max()}")
else:
    print(f"\n✓ No movies after {effective_end_date.strftime('%Y-%m-%d')} found")

print(f"\nSummary:")
print(f"  Total movies in final dataset: {len(final_merged)}")
print(f"  Total movies excluded: {len(movies_before_1990) + len(movies_after_2025_10)}")
print(f"  Total movies in original dataset: {len(final_df_all_merged)}")


GENERATING EXCLUDED MOVIE FILES

✓ Saved 45 movies before 1990 to: ../data/cleaned/movies_before_1990.csv
  Date range: 1927-03-05 00:00:00 to 1989-08-08 00:00:00

✓ Saved 18 movies after 2025-10-26 to: ../data/cleaned/movies_after_2025_10.csv
  Date range: 2025-10-28 00:00:00 to 2025-11-06 00:00:00

Summary:
  Total movies in final dataset: 4429
  Total movies excluded: 63
  Total movies in original dataset: 10067
