# Initial EDA

## Business Understanding

What defines success for a film?
- ROI - box office success translates to high ticket sales against low operating costs; look at high grossing movies with low production budgets
- Ratings - popularity can be gleaned from sampling audience reviews; what is the correlation between popularity and profitability

What are commonalities among the most successful films?
- Genre - are there specific combinations of genres that perform better than the rest
- Duration - what is the average film length and what are the limits on runtime that would maximize success

## Data Understanding

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Sources


[**imDB**](https://www.imdb.com)

'Data/im.db' - (8 tables)
- SQL database containing movie info and cast & crew details

In [2]:
# imDB
conn = sqlite3.connect('../Data/im.db')
pd.read_sql("""                        
SELECT * FROM sqlite_master
WHERE type='table' 
""", conn)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."
8,table,cleaned_movies,cleaned_movies,41369,"CREATE TABLE ""cleaned_movies"" (\n""index"" INTEG..."


[**The Numbers**](https://www.the-numbers.com)

'Data/tn.movie_budgets.csv.gz' - (5782 rows x 6 cols)
- production budget, domestic/worldwide gross revenues

In [3]:
# The Numbers
pd.read_csv('../Data/tn.movie_budgets.csv.gz').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


[**Box Office Mojo**](https://www.boxofficemojo.com)

'Data/bom.movie_gross.csv.gz' - (3387 rows x 5 columns)

- additional info on studio, gross revenue


In [4]:
# Box Office Mojo
pd.read_csv('../Data/bom.movie_gross.csv.gz').info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


[**The Movie DB**](https://www.themoviedb.org)

'Data/tmdb.movies.csv.gz' - (26517 rows x 10 cols)

- additional info on genre, language, votes/popularity


In [5]:
# The Movie DB
pd.read_csv('../Data/tmdb.movies.csv.gz').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


[**Rotten Tomatoes**](https://www.rottentomatoes.com)

'Data/rt.movie_info.tsv.gz' - (1560 rows x 12 cols)
- synopsis, rating, runtime, etc.


'Data/rt.reviews.tsv.gz' - (54432 rows x 8 cols)
- additional info on reviews, ratings



In [6]:
# Rotten Tomatoes - movie info
pd.read_csv('../Data/rt.movie_info.tsv.gz', sep='\t' ).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [7]:
# Rotten Tomatoes - reviews
pd.read_csv('../Data/rt.reviews.tsv.gz', sep='\t', encoding='latin-1').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


### Data Cleaning

We focused on the data from imDB and The Numbers

In [8]:
# imDB - movie_basics
# 146,144 entries

pd.read_sql("SELECT * FROM movie_basics", conn).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [9]:
# imDB - movie_ratings
# 73,856 entries

pd.read_sql("SELECT * FROM movie_ratings", conn).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [10]:
# imDB - create dataframe combining relevant data from 'movie_basics' and 'movie_ratings' tables

# SELECT DISTINCT ?

imdb_df = pd.read_sql("""
SELECT primary_title, original_title, runtime_minutes, genres, start_year, averagerating, numvotes
FROM movie_basics 
JOIN movie_ratings
USING (movie_id)
""", conn)

imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   primary_title    73856 non-null  object 
 1   original_title   73856 non-null  object 
 2   runtime_minutes  66236 non-null  float64
 3   genres           73052 non-null  object 
 4   start_year       73856 non-null  int64  
 5   averagerating    73856 non-null  float64
 6   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 3.9+ MB


In [11]:
# The Numbers - https://www.the-numbers.com/glossary
# 5,782 entries

pd.read_csv('../Data/tn.movie_budgets.csv.gz').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [16]:
roi_df = pd.read_csv('../Data/tn.movie_budgets.csv.gz')

def convert_revenue_columns(df, columns):
    for column in columns:
        df[column] = pd.to_numeric(df[column].str.replace('[\$,]', '', regex=True), errors='coerce')
    return df

convert_columns = ['production_budget', 'domestic_gross', 'worldwide_gross']
roi_df = convert_currency_columns(roi_df, convert_columns )

In [17]:
roi_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,7000,0,0
5778,79,"Apr 2, 1999",Following,6000,48482,240495
5779,80,"Jul 13, 2005",Return to the Land of Wonders,5000,1338,1338
5780,81,"Sep 29, 2015",A Plague So Pleasant,1400,0,0


In [18]:
roi_df['release_date'] = pd.to_datetime(roi_df['release_date'], errors='coerce')

In [19]:
# Create column 'ROI' defined as 'worldwide_gross' - 'production_budget'
roi_df['ROI'] = roi_df['worldwide_gross'] - roi_df['production_budget']

roi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   int64         
 4   domestic_gross     5782 non-null   int64         
 5   worldwide_gross    5782 non-null   int64         
 6   ROI                5782 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 316.3+ KB


In [20]:
roi_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,ROI
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2351345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,635063875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-200237650
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,1072413963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,999721747


In [21]:
roi_df.tail()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,ROI
5777,78,2018-12-31,Red 11,7000,0,0,-7000
5778,79,1999-04-02,Following,6000,48482,240495,234495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662
5780,81,2015-09-29,A Plague So Pleasant,1400,0,0,-1400
5781,82,2005-08-05,My Date With Drew,1100,181041,181041,179941


In [None]:
# Merge imdb_df and roi_df, drop rows with NA (118 runtime, 8 genre)
# 2752 entries

movie_df = pd.merge(imdb_df, roi_df, left_on='primary_title', right_on='movie', how='outer')
movie_df = movie_df[(movie_df['movie'] == movie_df['primary_title']) | (movie_df['movie'] == movie_df['original_title'])]
movie_df = movie_df.dropna()

movie_df.info()

In [None]:
# Break out genres into individual rows
# Make this a separate df ?

movie_df['genres'] = movie_df['genres'].str.split(',')
movie_df = movie_df.explode('genres')
movie_df

## Data Analysis

In [None]:
# This counts a movie multiple times if it has more than one genre

# Create series of genre counts
genre_counts = movie_df['genres'].value_counts()

# Create bar chart 
fig, ax = plt.subplots(figsize=(12, 8))
ax.bar(genre_counts.index, genre_counts.values)
ax.set_ylabel('Number of Movies')
ax.set_title('Count of Movies by Genre')
ax.set_xticklabels(genre_counts.index, rotation=45, ha='right')
plt.show()

In [None]:
# Create box plot of runtime

plt.figure(figsize=(12, 6))
plt.boxplot(movie_df['runtime_minutes'], vert=True)
plt.title('Boxplot of Film Runtimes')
plt.xlabel('Runtime in Minutes')
plt.grid(True)
plt.show()