# Exploring the Data

## 1.1 Load the Datasets
- Load Box Office Mojo and IMDB datasets into pandas dataframes using pd.read_csv() and pd.read_sql ()
- Use head(), info(), and describe() methods to get a quick overview of each dataset.

In [3]:
#Import and load the data with pandas and sqlite3
import sqlite3
import pandas as pd

### Box office Data

In [9]:
box_office_data = pd.read_csv('/Users/saniaspry/Documents/Flatiron/Phase-2/Phase-2-Project/Phase-2-Project/data/bom.movie_gross.csv.gz')

In [10]:
box_office_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [12]:
box_office_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [13]:
box_office_data.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


### Imdb Data

In [15]:
# Connect to the database
conn = sqlite3.connect('/Users/saniaspry/Documents/Flatiron/Phase-2/Phase-2-Project/Phase-2-Project/data/im.db')

In [27]:
#View all data from sqlite_master such as table names
query = "SELECT * FROM sqlite_master"


In [28]:
# Load the data into a pandas DataFrame
imdb_data = pd.read_sql(query, conn)
imdb_data

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


## 1.2 Understand the structure:
- Identify the key features in each dataset (e.g., movie title, genre, budget, revenue, etc.).
- Use value_counts() to look at the distribution of key categorical variables like genre or director.

### Box Office Data Key Features

- title (movie title)
- studio (movie production studio)
- domestic_gross (revenue from domestic box office)
- foreign_gross (revenue from foreign box office)
- year (release year)

Key Data Insights:

- Some missing values in studio and domestic_gross.
- Significant missing values in foreign_gross.
- Data types are generally correct except for foreign_gross, which is stored as object but should likely be float64 to handle numeric operations.

### IMDB Features
- movie_basics: Contains key information about movies like movie_id, title, genre, runtime_minutes, start_year, etc.
- directors: Links directors to movies via movie_id.
- known_for: Associates people (person_id) with movies.
- movie_ratings: Contains information about movie ratings (average_rating, num_votes).
- persons: Holds person-specific details such as name, birth_year, death_year, etc.
- principals: Contains cast and crew information for each movie.
- writers: Links writers to movies via movie_id.


### Data Distribution of Key Categorical Variables (Box Office Data)

In [29]:
# Distribution of studios
print(box_office_data['studio'].value_counts())

# Distribution of years
print(box_office_data['year'].value_counts())


studio
IFC           166
Uni.          147
WB            140
Fox           136
Magn.         136
             ... 
E1              1
PI              1
ELS             1
PalT            1
Synergetic      1
Name: count, Length: 257, dtype: int64
year
2015    450
2016    436
2012    400
2011    399
2014    395
2013    350
2010    328
2017    321
2018    308
Name: count, dtype: int64


This would show which studios and years are most represented in the dataset, which can help analyze trends over time or by studio.

### Distribution of Key Categorical Variables (IMDB Database):


In [38]:
# Distribution of genres in movie_basics
query1 = "SELECT genres FROM movie_basics"
movie_basics = pd.read_sql(query1, conn)
print(movie_basics['genres'].value_counts())

# Distribution of directors
query2 = "SELECT person_id FROM directors"
directors = pd.read_sql(query2, conn)
print(directors['person_id'].value_counts())


genres
Documentary                   32185
Drama                         21486
Comedy                         9177
Horror                         4372
Comedy,Drama                   3519
                              ...  
Adventure,Music,Mystery           1
Documentary,Horror,Romance        1
Sport,Thriller                    1
Comedy,Sport,Western              1
Adventure,History,War             1
Name: count, Length: 1085, dtype: int64
person_id
nm6935209     238
nm2563700     190
nm1546474     185
nm3877467     180
nm3848412     144
             ... 
nm8950870       1
nm6461704       1
nm8963989       1
nm7094378       1
nm10123248      1
Name: count, Length: 109253, dtype: int64


This gives an idea of which genres are most common and which directors have worked on the most films.

## 1.3 Identify Relationships Between Datasets
Merging Box Office Mojo with IMDB Database:
 - Possible keys: Merge the Box Office Mojo dataset and the IMDB data using the title field from Box Office Mojo and the title field in the movie_basics table. Another option could be to use the movie_id from the IMDB database if it exists in both datasets.

Merging Tables within the IMDB Database:

- The tables within the IMDB database can be merged using movie_id to combine relevant information

  - movie_basics with movie_ratings to get both the movie details and their ratings.
  - movie_basics with directors, writers, or principals to understand crew members associated with each movie.

## 1.4 Outline Business Questions:
- What genres perform best at the box office?
   - Analyze which genres generate the highest domestic and foreign box office revenue. 
   - Use the genre field from the movie_basics table and the domestic_gross, foreign_gross from the box office data.
- What factors influence a movie’s success (budget, genre, director, etc.)?
   - Investigate how different factors such as budget (from the IMDB database), genre, and director influence a movie's success in terms of revenue or ratings.
- Which movies have the highest ROI (return on investment)?
   - Calculate ROI for movies using the formula:
   - ROI= (domestic_gross+foreign_gross−budget) / budget
   - Analyze which genres or studios tend to produce the highest ROI, providing a cost-benefit perspective to movie production.

# Data Cleaning

## 2.1 Handle Missing Values
As mentioned earlier, there are missing values in the studio, domestic_gross and in foreign_gross columns in the box office data.

- Use isnull() and sum() to identify columns with missing data.

Depending on the context:
- Drop rows or columns with a large amount of missing data using dropna().
- Impute missing values with appropriate statistics (mean, median, mode) using fillna().


1. studio (5 missing values):
- Fill in the missing values with "Unknown" rather than dropping them. These movies still have important data like gross earnings, which is crucial for analysis.
  - The studio is not the primary focus of our analysis, and removing these rows could unnecessarily reduce the size of our dataset.

In [40]:
box_office_data['studio'].fillna('Unknown', inplace=True)
box_office_data['studio']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  box_office_data['studio'].fillna('Unknown', inplace=True)


0               BV
1               BV
2               WB
3               WB
4             P/DW
           ...    
3382         Magn.
3383            FM
3384          Sony
3385    Synergetic
3386         Grav.
Name: studio, Length: 3387, dtype: object

2. domestic_gross (28 missing values):
- Drop rows where domestic_gross is missing.
  - Missing domestic gross values make it impossible to assess a movie’s financial performance, which is essential for our analysis. Imputing a value here (e.g., with a mean or median) could distort our analysis.

In [43]:
box_office_data = box_office_data.dropna(subset=['domestic_gross'])
box_office_data['domestic_gross']


0       415000000.0
1       334200000.0
2       296000000.0
3       292600000.0
4       238700000.0
           ...     
3382         6200.0
3383         4800.0
3384         2500.0
3385         2400.0
3386         1700.0
Name: domestic_gross, Length: 3359, dtype: float64

3. foreign_gross (1350 missing values):
- Fill in missing foreign_gross values with 0.
  - While a missing foreign gross could imply that the movie was not released internationally, setting the value to 0 allows us to continue analyzing its total performance, especially if it performed well domestically.

In [44]:
box_office_data['foreign_gross'].fillna(0, inplace=True)
box_office_data['foreign_gross']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  box_office_data['foreign_gross'].fillna(0, inplace=True)


0       652000000
1       691300000
2       664300000
3       535700000
4       513900000
          ...    
3382            0
3383            0
3384            0
3385            0
3386            0
Name: foreign_gross, Length: 3359, dtype: object

## 2.2 Handle Incorrect Data Types
- Convert columns to their correct data types. 
- Convert the foreign_gross column to numeric values since it's currently stored as an object.

In [47]:
# Convert 'foreign_gross' to numeric, coerce errors
box_office_data['foreign_gross'] = pd.to_numeric(box_office_data['foreign_gross'], errors='coerce')
(box_office_data['foreign_gross'])


0       652000000.0
1       691300000.0
2       664300000.0
3       535700000.0
4       513900000.0
           ...     
3382            0.0
3383            0.0
3384            0.0
3385            0.0
3386            0.0
Name: foreign_gross, Length: 3359, dtype: float64

## 2.3 Splitting and Normalizing the Generes Columns
- Problem: Movies might be listed with multiple genres (e.g., Action, Comedy), so we need to normalize the genre data for easier analysis.
- Solution: Split the genres into separate rows so that each movie has one genre per row.

In [48]:
# Split the 'genres' column by commas
movie_basics['genres'] = movie_basics['genres'].str.split(',')

# Explode the list of genres into individual rows
movie_basics = movie_basics.explode('genres')

movie_basics

Unnamed: 0,genres
0,Action
0,Crime
0,Drama
1,Biography
1,Drama
...,...
146139,Drama
146140,Documentary
146141,Comedy
146142,


This step ensures that genre analysis will be more accurate, as movies with multiple genres will be counted individually for each genre.

## 2.4 Checking for Duplicate or Irrelevant Data:
- Ensure there are no duplicate rows in the movie_basics or directors tables.
- Remove irrelevant columns that are not needed for analysis 


In [None]:
# Drop duplicates if any
# movie_basics.drop_duplicates(inplace=True)
# directors.drop_duplicates(inplace=True)

## 2.5 Handling Director Information
- Instead of relying on person_id, we can join the directors table with the persons table to get director names for easier interpretation.
- This will be helpful when identifying relationships between directors and movie success.

In [52]:
# Join directors with persons to get director names
query3 = """
SELECT  DISTINCT d.person_id, p.primary_name 
FROM directors d
JOIN persons p ON d.person_id = p.person_id
"""
director_details = pd.read_sql(query3, conn)
director_details


Unnamed: 0,person_id,primary_name
0,nm0899854,Tony Vitale
1,nm1940585,Bill Haley
2,nm0151540,Jay Chandrasekhar
3,nm0089502,Albert Pyun
4,nm2291498,Joe Baile
...,...,...
109246,nm10122247,C. Damon Adcock
109247,nm10122357,Daysi Burbano
109248,nm6711477,Bernard Lessa
109249,nm10123242,Tate Nova


# Data Transformation
Filter the datasets to include only the most relevant features for our analysis.