# Initial EDA

## Business Understanding

What defines success for a film?
- ROI - box office success translates to high ticket sales against low operating costs; look at high grossing movies with low production budgets
- Ratings - popularity can be gleaned from sampling audience reviews; what is the correlation between popularity and profitability

What are commonalities among the most successful films?
- Genre - are there specific combinations of genres that perform better than the rest
- Duration - what is the average film length and what are the limits on runtime that would maximize success

## Data Understanding

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Sources


[**imDB**](https://www.imdb.com)

'Data/im.db' - (8 tables)
- SQL database containing movie info and cast & crew details

In [None]:
# imDB
conn = sqlite3.connect('../Data/im.db')
pd.read_sql("""                        
SELECT * FROM sqlite_master
WHERE type='table' 
""", conn)

[**The Numbers**](https://www.the-numbers.com)

'Data/tn.movie_budgets.csv.gz' - (5782 rows x 6 cols)
- production budget, domestic/worldwide gross revenues

In [None]:
# The Numbers
pd.read_csv('../Data/tn.movie_budgets.csv.gz').info()

[**Box Office Mojo**](https://www.boxofficemojo.com)

'Data/bom.movie_gross.csv.gz' - (3387 rows x 5 columns)

- additional info on studio, gross revenue


In [None]:
# Box Office Mojo
pd.read_csv('../Data/bom.movie_gross.csv.gz').info()


[**The Movie DB**](https://www.themoviedb.org)

'Data/tmdb.movies.csv.gz' - (26517 rows x 10 cols)

- additional info on genre, language, votes/popularity


In [None]:
# The Movie DB
pd.read_csv('../Data/tmdb.movies.csv.gz').info()

[**Rotten Tomatoes**](https://www.rottentomatoes.com)

'Data/rt.movie_info.tsv.gz' - (1560 rows x 12 cols)
- synopsis, rating, runtime, etc.


'Data/rt.reviews.tsv.gz' - (54432 rows x 8 cols)
- additional info on reviews, ratings



In [None]:
# Rotten Tomatoes - movie info
pd.read_csv('../Data/rt.movie_info.tsv.gz', sep='\t' ).info()

In [None]:
# Rotten Tomatoes - reviews
pd.read_csv('../Data/rt.reviews.tsv.gz', sep='\t', encoding='latin-1').info()

### Data Cleaning

We focused on the data from imDB and The Numbers

In [None]:
# imDB - movie_basics
# 146,144 entries

pd.read_sql("SELECT * FROM movie_basics", conn).info()

In [None]:
# imDB - movie_ratings
# 73,856 entries

pd.read_sql("SELECT * FROM movie_ratings", conn).info()

In [None]:
# imDB - create dataframe combining relevant data from 'movie_basics' and 'movie_ratings' tables

# SELECT DISTINCT ?

imdb_df = pd.read_sql("""
SELECT primary_title, original_title, runtime_minutes, genres, start_year, averagerating, numvotes
FROM movie_basics 
JOIN movie_ratings
USING (movie_id)
""", conn)

imdb_df.info()

In [None]:
# The Numbers - https://www.the-numbers.com/glossary
# 5,782 entries

pd.read_csv('../Data/tn.movie_budgets.csv.gz').info()

In [None]:
# The Numbers - create dataframe
roi_df = pd.read_csv('../Data/tn.movie_budgets.csv.gz')

# Convert release_date column to datetime
roi_df['release_date'] = pd.to_datetime(roi_df['release_date'], errors='coerce')

# Convert budget and revenue columns to numeric
roi_df['production_budget'] = pd.to_numeric(roi_df['production_budget'].str.replace('[\$,]', '', regex=True), errors='coerce')
roi_df['domestic_gross'] = pd.to_numeric(roi_df['domestic_gross'].str.replace('[\$,]', '', regex=True), errors='coerce')
roi_df['worldwide_gross'] = pd.to_numeric(roi_df['worldwide_gross'].str.replace('[\$,]', '', regex=True), errors='coerce')

# Create column 'ROI' defined as 'worldwide_gross' - 'production_budget'
roi_df['ROI'] = roi_df['worldwide_gross'] - roi_df['production_budget']

roi_df.info()


In [None]:
# Merge imdb_df and roi_df, drop rows with NA (118 runtime, 8 genre)
# 2752 entries

movie_df = pd.merge(imdb_df, roi_df, left_on='primary_title', right_on='movie', how='outer')
movie_df = movie_df[(movie_df['movie'] == movie_df['primary_title']) | (movie_df['movie'] == movie_df['original_title'])]
movie_df = movie_df.dropna()

movie_df.info()

In [None]:
# Break out genres into individual rows
# Make this a separate df ?

movie_df['genres'] = movie_df['genres'].str.split(',')
movie_df = movie_df.explode('genres')
movie_df

## Data Analysis

In [None]:
# This counts a movie multiple times if it has more than one genre

# Create series of genre counts
genre_counts = movie_df['genres'].value_counts()

# Create bar chart 
fig, ax = plt.subplots(figsize=(12, 8))
ax.bar(genre_counts.index, genre_counts.values)
ax.set_ylabel('Number of Movies')
ax.set_title('Count of Movies by Genre')
ax.set_xticklabels(genre_counts.index, rotation=45, ha='right')
plt.show()

In [None]:
# Create box plot of runtime

plt.figure(figsize=(12, 6))
plt.boxplot(movie_df['runtime_minutes'], vert=True)
plt.title('Boxplot of Film Runtimes')
plt.xlabel('Runtime in Minutes')
plt.grid(True)
plt.show()