# Exploring the Data

## 1.1 Load the Datasets
- Load Box Office Mojo and IMDB datasets into pandas dataframes using pd.read_csv() and pd.read_sql ()
- Use head(), info(), and describe() methods to get a quick overview of each dataset.

In [3]:
#Import and load the data with pandas and sqlite3
import sqlite3
import pandas as pd

### Box office Data

In [9]:
box_office_data = pd.read_csv('/Users/saniaspry/Documents/Flatiron/Phase-2/Phase-2-Project/Phase-2-Project/data/bom.movie_gross.csv.gz')

In [10]:
box_office_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [12]:
box_office_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [13]:
box_office_data.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


### Imdb Data

In [15]:
# Connect to the database
conn = sqlite3.connect('/Users/saniaspry/Documents/Flatiron/Phase-2/Phase-2-Project/Phase-2-Project/data/im.db')

In [27]:
#View all data from sqlite_master such as table names
query = "SELECT * FROM sqlite_master"


In [28]:
# Load the data into a pandas DataFrame
imdb_data = pd.read_sql(query, conn)
imdb_data

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


## 1.2 Understand the structure:
- Identify the key features in each dataset (e.g., movie title, genre, budget, revenue, etc.).
- Use value_counts() to look at the distribution of key categorical variables like genre or director.

### Box Office Data Key Features

- title (movie title)
- studio (movie production studio)
- domestic_gross (revenue from domestic box office)
- foreign_gross (revenue from foreign box office)
- year (release year)

Key Data Insights:

- Some missing values in studio and domestic_gross.
- Significant missing values in foreign_gross.
- Data types are generally correct except for foreign_gross, which is stored as object but should likely be float64 to handle numeric operations.

### IMDB Features
- movie_basics: Contains key information about movies like movie_id, title, genre, runtime_minutes, start_year, etc.
- directors: Links directors to movies via movie_id.
- known_for: Associates people (person_id) with movies.
- movie_ratings: Contains information about movie ratings (average_rating, num_votes).
- persons: Holds person-specific details such as name, birth_year, death_year, etc.
- principals: Contains cast and crew information for each movie.
- writers: Links writers to movies via movie_id.


### Data Distribution of Key Categorical Variables (Box Office Data)

In [29]:
# Distribution of studios
print(box_office_data['studio'].value_counts())

# Distribution of years
print(box_office_data['year'].value_counts())


studio
IFC           166
Uni.          147
WB            140
Fox           136
Magn.         136
             ... 
E1              1
PI              1
ELS             1
PalT            1
Synergetic      1
Name: count, Length: 257, dtype: int64
year
2015    450
2016    436
2012    400
2011    399
2014    395
2013    350
2010    328
2017    321
2018    308
Name: count, dtype: int64


This would show which studios and years are most represented in the dataset, which can help analyze trends over time or by studio.

### Distribution of Key Categorical Variables (IMDB Database):


In [38]:
# Distribution of genres in movie_basics
query1 = "SELECT genres FROM movie_basics"
movie_basics = pd.read_sql(query1, conn)
print(movie_basics['genres'].value_counts())

# Distribution of directors
query2 = "SELECT person_id FROM directors"
directors = pd.read_sql(query2, conn)
print(directors['person_id'].value_counts())


genres
Documentary                   32185
Drama                         21486
Comedy                         9177
Horror                         4372
Comedy,Drama                   3519
                              ...  
Adventure,Music,Mystery           1
Documentary,Horror,Romance        1
Sport,Thriller                    1
Comedy,Sport,Western              1
Adventure,History,War             1
Name: count, Length: 1085, dtype: int64
person_id
nm6935209     238
nm2563700     190
nm1546474     185
nm3877467     180
nm3848412     144
             ... 
nm8950870       1
nm6461704       1
nm8963989       1
nm7094378       1
nm10123248      1
Name: count, Length: 109253, dtype: int64


This gives an idea of which genres are most common and which directors have worked on the most films.

## 1.3 Identify Relationships Between Datasets
Merging Box Office Mojo with IMDB Database:
 - Possible keys: Merge the Box Office Mojo dataset and the IMDB data using the title field from Box Office Mojo and the title field in the movie_basics table. Another option could be to use the movie_id from the IMDB database if it exists in both datasets.

Merging Tables within the IMDB Database:

- The tables within the IMDB database can be merged using movie_id to combine relevant information

  - movie_basics with movie_ratings to get both the movie details and their ratings.
  - movie_basics with directors, writers, or principals to understand crew members associated with each movie.