## This should be the main notebook where we run all the scripts at once.

You all can work on the loading functions in this notebook using the two dataframes (`new_movies_df` and `new_actors_df`). No need to load from .csv files.

The cell below runs the other two notebooks and imports the functions in the two notebooks.

In [30]:
%run data_extraction.ipynb
%run data_transformation.ipynb



ModuleNotFoundError: No module named 'selenium'

ModuleNotFoundError: No module named 'selenium'

In [None]:
!pip install SQLAlchemy

In [None]:
# to manage json data
import json

# for pandas dataframes
import pandas as pd

import sqlalchemy as db

import numpy as np

The data stored in the DataFrames (`new_movies_df` and `new_actors_df`) are cleaned and set to the correct data types. Do let me know if they are still not clean and not set to the correct data types. As far as I know, empty values are stored as `0` (integer), `None` (not a string), or `"NaN"` (string). But for the `gender` column in the `new_actors_df` DataFrame, some rows/records will have the `"None"` value stored as a string.

In [None]:
movies_df, actors_df = extract_data()
new_movies_df, new_actors_df = transform_data(movies_df, actors_df)
export_dataframe_to_csv(new_movies_df, '../resources/movies_cleaned.csv')
export_dataframe_to_csv(new_actors_df, '../resources/actors_cleaned.csv')

print(new_movies_df)
print()
print(new_actors_df)

Getting the list of actors with missing data. Those with "None" as gender are those that have missing date of birth and gender data. I don't know if you all want to remove these actors.

In [None]:
print(new_actors_df[new_actors_df["gender"] == "None"])

Feel free to add cells below and work on the LOADING process into our database.

### Create Database in PostgreSQL 

In [36]:
!pip install sqlalchemy_utils



In [39]:
from sqlalchemy import create_engine  # Import create_engine from SQLAlchemy
from sqlalchemy_utils import create_database, database_exists  # Import utilities

# Database connection details
db_url = 'postgresql://postgres:francistan123@localhost:5432/moviesdb'

# Create engine
engine = create_engine(db_url)

# Check if the database exists
if not database_exists(engine.url):
    # Create database
    create_database(engine.url)
    print(f"Database 'moviesdb' created successfully!")
else:
    print(f"Database 'moviesdb' already exists.")

# release resources associated with engine
engine.dispose()

Database 'moviesdb' already exists.


In [32]:
actors_df = pd.read_csv('actors_cleaned.csv')
movies_df = pd.read_csv('movies_cleaned.csv')
# Summary of the data
print("\nActors CSV Info:")
actors_df.info()

print("\nMovies CSV Info:")
movies_df.info()


Actors CSV Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 591 entries, 0 to 590
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   591 non-null    object
 1   date_of_birth          573 non-null    object
 2   date_of_death          186 non-null    object
 3   gender                 580 non-null    object
 4   num_of_acting_credits  591 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 23.2+ KB

Movies CSV Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          250 non-null    object 
 1   year           250 non-null    int64  
 2   certification  248 non-null    object 
 3   release_date   250 non-null    object 
 4   runtime        250 non-null    object 
 5   genre          250 non-null    obj

### Create Tables in PostgreSQL

In [40]:
# Create connection engine

#user postgres, password admin, USING existing database testdb
engine = create_engine(db_url)

conn = engine.raw_connection()

In [41]:
# Define SQL commands
commands = [
    '''CREATE TABLE IF NOT EXISTS actors (
        id SERIAL PRIMARY KEY,
        name VARCHAR NOT NULL,
        date_of_birth DATE,
        date_of_death DATE,
        gender VARCHAR,
        num_of_acting_credits INT
    );''',
    
    '''CREATE TABLE IF NOT EXISTS genre (
        id SERIAL PRIMARY KEY,
        genre_name VARCHAR UNIQUE
    );''',
    
    '''CREATE TABLE IF NOT EXISTS certification (
        id SERIAL PRIMARY KEY,
        certification_code VARCHAR UNIQUE
    );''',
    
    '''CREATE TABLE IF NOT EXISTS director (
        id SERIAL PRIMARY KEY,
        director_name VARCHAR UNIQUE
    );''',
    
    '''CREATE TABLE IF NOT EXISTS movies (
        id SERIAL PRIMARY KEY,
        title VARCHAR NOT NULL,
        year INT,
        certification_id INT REFERENCES certification(id),
        release_date DATE,
        runtime VARCHAR,
        genre_id INT REFERENCES genre(id),
        description TEXT,
        language VARCHAR,
        country VARCHAR,
        director_id INT REFERENCES director(id),
        oscars INT,
        winnings INT,
        nominations INT,
        ratings FLOAT,
        num_of_votes INT,
        revenue BIGINT,
        budget BIGINT
    );''' ]

# Initialize connection to PostgreSQL
try:
    # Establish connection
    conn = engine.raw_connection()
    cur = conn.cursor()

    # Execute SQL commands
    for command in commands:
        cur.execute(command)
    
    # Commit changes
    conn.commit()
    print("Tables created successfully.")

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Close communication with server
    if cur:
        cur.close()
    if conn:
        conn.close()

Tables created successfully.


In [None]:
# Load CSV files into pandas DataFrames
actors_df = pd.read_csv('actors_cleaned.csv')
movies_df = pd.read_csv('movies_cleaned.csv')

# Step 1: Insert data into `certification`, `genre`, and `director` tables

# Extract unique certifications, genres, and directors from movies_df
certifications_df = pd.DataFrame(movies_df['certification'].unique(), columns=['certification_code'])
genres_df = pd.DataFrame(movies_df['genre'].unique(), columns=['genre_name'])
directors_df = pd.DataFrame(movies_df['director'].unique(), columns=['director_name'])

# Insert into certification table
certifications_df.to_sql('certification', engine, if_exists='append', index=False)

# Insert into genre table
genres_df.to_sql('genre', engine, if_exists='append', index=False)

# Insert into director table
directors_df.to_sql('director', engine, if_exists='append', index=False)

# Insert the actors table
actors_df.to_sql('actors', engine, if_exists='append', index=False)

# mapping of genre_id, director_id, certification_id and merge into movies_df

# Map certification_id
certifications_df = pd.read_sql("SELECT id AS certification_id, certification_code FROM certification", engine)
movies_df = movies_df.merge(certifications_df, how='left', left_on='certification', right_on='certification_code')

# Map genre_id
genres_df = pd.read_sql("SELECT id AS genre_id, genre_name FROM genre", engine)
movies_df = movies_df.merge(genres_df, how='left', left_on='genre', right_on='genre_name')

# Map director_id
directors_df = pd.read_sql("SELECT id AS director_id, director_name FROM director", engine)
movies_df = movies_df.merge(directors_df, how='left', left_on='director', right_on='director_name')

# Select relevant columns for the movies table
movies_columns = ['title', 'year', 'certification_id', 'release_date', 'runtime', 
                  'genre_id', 'description', 'language', 'country', 
                  'director_id', 'oscars', 'winnings', 'nominations', 
                  'ratings', 'num_of_votes', 'revenue', 'budget']

# Insert into the movies table
movies_df[movies_columns].to_sql('movies', engine, if_exists='append', index=False)



An error occurred: name 'text' is not defined
