# Movie Data ETL Pipeline - Load

With the data extracted and transformed from the 2 previous notebooks, this notebook will focus on loading the data into a PostgreSQL database. We will also be incorporating some of the ratings data into the combined movie data before loading it into the database.

### Dependencies and data

In [1]:
# Dependencies
import os
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
from config import PSQL_PW
%matplotlib inline

In [2]:
# Plot settings
plot_styles = mpl.style.available
mpl.style.use(plot_styles[0])
mpl.rcParams['figure.figsize'] = (12, 4)
mpl.rcParams['font.size'] = 15

In [3]:
# Path to data directory
data_path = os.path.join('..', 'data')

# Paths to data files
movies_path = os.path.join(data_path, 'movies.pkl')
ratings_path = os.path.join(data_path, 'raw', 'ratings.csv')
print(movies_path)
print(ratings_path)

../data/movies.pkl
../data/raw/ratings.csv


In [4]:
# Combined movie data
movies_df = pd.read_pickle(movies_path)
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5982 entries, 0 to 5982
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   movie_id              5982 non-null   int64         
 1   imdb_id               5982 non-null   object        
 2   imdb_link             5982 non-null   object        
 3   url                   5982 non-null   object        
 4   poster_path           5981 non-null   object        
 5   title                 5982 non-null   object        
 6   overview              5977 non-null   object        
 7   release_date          5982 non-null   datetime64[ns]
 8   year                  5982 non-null   int64         
 9   runtime               5982 non-null   float64       
 10  budget                4600 non-null   float64       
 11  revenue               5149 non-null   float64       
 12  genres                5982 non-null   object        
 13  country           

In [5]:
# Rating data
ratings_df = pd.read_csv(ratings_path)
ratings_df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 4 columns):
 #   Column     Non-Null Count     Dtype  
---  ------     --------------     -----  
 0   userId     26024289 non-null  int64  
 1   movieId    26024289 non-null  int64  
 2   rating     26024289 non-null  float64
 3   timestamp  26024289 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 794.2 MB


In [6]:
# Convert timestamp to datetime type
ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'], unit='s')
ratings_df.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,2015-03-09 22:52:09
1,1,147,4.5,2015-03-09 23:07:15


### Aggregate ratings by movie

With over 20 million rows in the ratings data, it would be helpful to summarize it with an aggregate and include it in the combined movie data. We will create a pivot table to count the number of times each movie got each numbered rating.

In [7]:
# Count ratings by movie
ratings_count_df = pd.pivot_table(ratings_df, index="movieId", columns='rating', 
                                  values='timestamp', aggfunc='count', fill_value=0).reset_index()
ratings_count_df.columns.name = None
ratings_count_df.head(2)

Unnamed: 0,movieId,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0
0,1,441,804,438,2083,1584,11577,5741,22020,5325,15995
1,2,263,797,525,2479,1810,8510,2916,6035,690,2035


In [8]:
# Add prefix to col names
ratings_count_df.columns = ['movie_id'] + ['rating_' + str(rating) for rating in ratings_count_df.columns[1:]]
ratings_count_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45115 entries, 0 to 45114
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   movie_id    45115 non-null  int64
 1   rating_0.5  45115 non-null  int64
 2   rating_1.0  45115 non-null  int64
 3   rating_1.5  45115 non-null  int64
 4   rating_2.0  45115 non-null  int64
 5   rating_2.5  45115 non-null  int64
 6   rating_3.0  45115 non-null  int64
 7   rating_3.5  45115 non-null  int64
 8   rating_4.0  45115 non-null  int64
 9   rating_4.5  45115 non-null  int64
 10  rating_5.0  45115 non-null  int64
dtypes: int64(11)
memory usage: 3.8 MB


### Combine movie and rating data

In [9]:
# Merge aggregate rating data with movie data
df = pd.merge(movies_df, ratings_count_df, on='movie_id', how='left')
df.head(2)

Unnamed: 0,movie_id,imdb_id,imdb_link,url,poster_path,title,overview,release_date,year,runtime,...,rating_0.5,rating_1.0,rating_1.5,rating_2.0,rating_2.5,rating_3.0,rating_3.5,rating_4.0,rating_4.5,rating_5.0
0,9548,tt0098987,https://www.imdb.com/title/tt0098987/,https://en.wikipedia.org/wiki/The_Adventures_o...,/yLeX2QLkHeRlYQRcbU8BKgMaYYD.jpg,The Adventures of Ford Fairlane,"Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",1990-07-11,1990,104.0,...,,,,,,,,,,
1,25501,tt0098994,https://www.imdb.com/title/tt0098994/,"https://en.wikipedia.org/wiki/After_Dark,_My_S...",/3hjcHNtWn9T6jVGXgNXyCsMWBdj.jpg,"After Dark, My Sweet",The intriguing relationship between three desp...,1990-08-24,1990,114.0,...,,,,,,,,,,


In [10]:
# Fill missing rating counts with 0
for col in df.columns[-10:]:
    df[col].fillna(0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5982 entries, 0 to 5981
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   movie_id              5982 non-null   int64         
 1   imdb_id               5982 non-null   object        
 2   imdb_link             5982 non-null   object        
 3   url                   5982 non-null   object        
 4   poster_path           5981 non-null   object        
 5   title                 5982 non-null   object        
 6   overview              5977 non-null   object        
 7   release_date          5982 non-null   datetime64[ns]
 8   year                  5982 non-null   int64         
 9   runtime               5982 non-null   float64       
 10  budget                4600 non-null   float64       
 11  revenue               5149 non-null   float64       
 12  genres                5982 non-null   object        
 13  country           

### Connect to PostgreSQL database

An empty PostgreSQL database was created in pgAdmin named `movie_db`. A connection will be made to this database so that the movie data can be loaded in.

In [11]:
# Database params
user = 'postgres'
pw = PSQL_PW
loc = '127.0.0.1'
port = '5432'
db = 'movie_db'

# Connection string format: "postgres://[user]:[password]@[location]:[port]/[database]"
db_string = f'postgresql://{user}:{pw}@{loc}:{port}/{db}'

# Create engine
engine = create_engine(db_string)
engine

Engine(postgres://postgres:***@127.0.0.1:5432/movie_db)

### Load movie data into database

In [12]:
# Create table for movie data
df.to_sql('movies', engine, if_exists='replace')
pd.read_sql('SELECT * FROM movies', engine).head(2)

Unnamed: 0,index,movie_id,imdb_id,imdb_link,url,poster_path,title,overview,release_date,year,...,rating_0.5,rating_1.0,rating_1.5,rating_2.0,rating_2.5,rating_3.0,rating_3.5,rating_4.0,rating_4.5,rating_5.0
0,0,9548,tt0098987,https://www.imdb.com/title/tt0098987/,https://en.wikipedia.org/wiki/The_Adventures_o...,/yLeX2QLkHeRlYQRcbU8BKgMaYYD.jpg,The Adventures of Ford Fairlane,"Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",1990-07-11,1990,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,25501,tt0098994,https://www.imdb.com/title/tt0098994/,"https://en.wikipedia.org/wiki/After_Dark,_My_S...",/3hjcHNtWn9T6jVGXgNXyCsMWBdj.jpg,"After Dark, My Sweet",The intriguing relationship between three desp...,1990-08-24,1990,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Load rating data into database

The ratings data itself will also be loaded into the database, but it does contain ratings from a lot of movies that are not in the movie data. There's no point in keeping those ratings, especially since it will make the loading time unnecessarily long.

We will reduce the rating data to only ratings of movies that are in the movie data. But even with this reduction, it could still have a large number of rows. To handle this, the reduced data will be saved and then read back in in chunks to be loaded into the database one chunk at a time.

In [13]:
# Filter the data to only the movies in the movie data
ratings_reduced_df = ratings_df[ratings_df['movieId'].isin(df['movie_id'].values)]
ratings_reduced_df.shape

(4265986, 4)

In [14]:
# Save reduced rating data
reduced_path = os.path.join(data_path, 'ratings_reduced.csv')
ratings_reduced_df.to_csv(reduced_path, index=False)
pd.read_csv(reduced_path).head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,858,5.0,2015-03-09 22:52:03
1,1,1246,5.0,2015-03-09 22:52:36


In [15]:
# Load rating data in chunks
loaded, chunksize = 0, 500000
start = dt.datetime.now() # start time
for chunk in pd.read_csv(reduced_path, chunksize=chunksize):
    to_row = min(loaded + chunksize, ratings_reduced_df.shape[0])
    print('Loading rows', loaded, 'to', to_row, end=' | ') # print progress
    chunk.to_sql('ratings', engine, if_exists='append') # write to db
    loaded += chunksize
    print((dt.datetime.now() - start), 'elapsed') # elapsed time

Loading rows 0 to 500000 | 0:01:15.224472 elapsed
Loading rows 500000 to 1000000 | 0:02:30.261546 elapsed
Loading rows 1000000 to 1500000 | 0:03:44.143791 elapsed
Loading rows 1500000 to 2000000 | 0:04:59.266626 elapsed
Loading rows 2000000 to 2500000 | 0:06:10.765867 elapsed
Loading rows 2500000 to 3000000 | 0:07:21.904725 elapsed
Loading rows 3000000 to 3500000 | 0:08:34.437819 elapsed
Loading rows 3500000 to 4000000 | 0:09:49.430826 elapsed
Loading rows 4000000 to 4265986 | 0:10:28.570608 elapsed


In [16]:
# Test query
pd.read_sql('SELECT * FROM ratings', engine).info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4265986 entries, 0 to 4265985
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   index      4265986 non-null  int64  
 1   userId     4265986 non-null  int64  
 2   movieId    4265986 non-null  int64  
 3   rating     4265986 non-null  float64
 4   timestamp  4265986 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 162.7+ MB
