# **Initializing Postgres Tables**

In this notebook, I'll be initialzing the Postgres tables. In order to properly run this notebook, you'll need to have a Postgres image running.


# **Setup**

The cells below will set up the rest of the notebook.

I'll start by configuring the kernel:


In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

# Set up some envvars
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=DEBUG
%env TQDM_ENABLED=True

d:\data\programming\neural-needledrop\database
env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=DEBUG
env: TQDM_ENABLED=True


Now I'll import some necessary modules:


In [2]:
# General import statements
import pandas as pd
from pandas_gbq import read_gbq
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Date, Float, Boolean, DateTime
from sqlalchemy.orm import sessionmaker, declarative_base
from sqlalchemy.sql import func, text

# Importing modules custom-built for this project
from utils.settings import (
    GBQ_PROJECT_ID,
    GBQ_DATASET_ID,
    POSTGRES_USER,
    POSTGRES_PASSWORD,
    POSTGRES_HOST,
    POSTGRES_PORT,
    POSTGRES_DB,
    LOG_TO_CONSOLE
)
from utils.logging import get_logger
from utils.postgres import delete_table, create_table

# Set up a logger for this notebook
logger = get_logger("postgres_notebook", log_to_console=LOG_TO_CONSOLE)

In [4]:
# Create the connection string to the database
postgres_connection_string = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Create the connection engine
engine = create_engine(postgres_connection_string)
metadata = MetaData()
session = sessionmaker(bind=engine)()
Base = declarative_base()

# **Deleting Tables**
First, we're going to delete the tables if they exist. 

In [5]:
# Indicate which tables we want to delete
tables_to_delete = ["video_metadata", "embeddings", "transcriptions"]

# Iterate through each of the tables and delete them
for table_name in tables_to_delete:
    delete_table(table_name, engine, logger)

2024-01-25 23:17:18,682 - postgres_notebook - DEBUG - Successfully deleted table 'video_metadata'
2024-01-25 23:17:18,698 - postgres_notebook - DEBUG - Successfully deleted table 'embeddings'
2024-01-25 23:17:18,713 - postgres_notebook - DEBUG - Successfully deleted table 'transcriptions'


# **Initializing Tables**
Next up: we're going to initialize the tables that we want to create. 

### `video_metadata`
The first table will be the `video_metadata` table, which will contain information about all of the videos in our database:

In [6]:
# Define the schema that we'll be using for this table
schema = [
    Column("id", String, primary_key=True),
    Column("title", String),
    Column("length", Integer),
    Column("channel_id", String),
    Column("channel_name", String),
    Column("short_description", String),
    Column("description", String),
    Column("view_ct", Integer),
    Column("url", String),
    Column("small_thumbnail_url", String),
    Column("large_thumbnail_url", String),
    Column("video_type", String),
    Column("review_score", Integer),
    Column("publish_date", DateTime),
    Column("scrape_date", DateTime),
]

# Create the table
create_table("video_metadata", schema, engine, metadata, logger)

2024-01-25 23:17:18,822 - postgres_notebook - DEBUG - Successfully created table 'video_metadata'


### `transcriptions`
Next up: the `transcriptions` table. This will contain the different video transcriptions!

In [7]:
# Define the schema that we'll be using for this table
transcriptions_table_schema = [
    Column("url", String),
    Column("text", String),
    Column("segment_id", Integer),
    Column("segment_seek", Integer),
    Column("segment_start", Integer),
    Column("segment_end", Integer),
]

# Create the table
create_table("transcriptions", transcriptions_table_schema, engine, metadata, logger)

2024-01-25 23:17:18,892 - postgres_notebook - DEBUG - Successfully created table 'transcriptions'


### `embeddings`
Next up: the `embeddings` table. This will require a *little* more setup than the other tables, since it's using the `pgvector` extension.

In [8]:
# Enable the pgvector Extension
session.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
session.commit()

from pgvector.sqlalchemy import Vector

# Now, we're going to create a table for the embeddings
embeddings_table_schema = [
    Column("id", String, primary_key=True),
    Column("url", String),
    Column("embedding_type", String),
    Column("start_segment", Integer),
    Column("end_segment", Integer),
    Column("segment_length", Integer),
    Column("embedding", Vector(1536)),
]

# Now, we're going to create a table for the embeddings
create_table("embeddings", embeddings_table_schema, engine, metadata, logger)

2024-01-25 23:17:18,985 - postgres_notebook - DEBUG - Successfully created table 'embeddings'


Now that we're done with this session, we'll close it. 

In [9]:
session.close()