# **Prototyping Data Pipeline**
In this notebook, I'm going to be writing a bunch of functions that prototype a data pipeline for this app. Once I write the functions, I'll move them out of this notebook and into a utility file that the main pipeline script can also access. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\pipeline


Now I'll import some necessary modules:

In [2]:
# General import statements
import pandas as pd
from pytubefix import YouTube, Channel
from google.cloud import bigquery
import traceback
import time
import random
from tqdm import tqdm
import pandas_gbq
import datetime
import uuid
from datetime import timedelta
from pathlib import Path
from google.cloud import storage
from google.cloud.exceptions import NotFound
import whisper

# Importing custom utility functions
import utils.gbq as gbq_utils
import utils.youtube as youtube_utils
import utils.gcs as gcs_utils

# Indicate whether or not we want tqdm progress bars
tqdm_enabled = True

# Set some constants for the project
GBQ_PROJECT_ID = "neural-needledrop"
GBQ_DATASET_ID = "backend_data"

# Set the pandas_gbq context to the project ID
# pandas_gbq.context.project = GBQ_PROJECT_ID

We'll also load in a whisper model:

In [3]:
# Load in the Whisper model of choice
whisper_model_size = "tiny"
whisper_model = whisper.load_model(whisper_model_size)

# Checking GBQ Table
The **very** first thing I need to do: check the actual `video_metadata` GBQ table to determine the most recent video I've downloaded. 

In [4]:
# Define the query that'll grab the most recent video url
most_recent_video_url_query = """
SELECT
  metadata.url
FROM
  `neural-needledrop.backend_data.video_metadata` metadata
ORDER BY
  publish_date DESC, scrape_date DESC
LIMIT 1
"""

# Use pandas-gbq to run the query
most_recent_video_url_df = pd.read_gbq(most_recent_video_url_query, project_id=GBQ_PROJECT_ID)

# If the length of the dataframe is zero, then we need to set the url to None
if len(most_recent_video_url_df) == 0:
    most_recent_video_url = None

# Otherwise, we can just grab the url from the dataframe
else:
    most_recent_video_url = most_recent_video_url_df.iloc[0]["url"]

# Identifying New Videos
The first portion of the pipeline: determining if there are any videos to work with in the first place! 

I'll start by parameterizing the method: 

In [5]:
# Parameterize the identification method
video_limit = 1000 # If video_limit is None, then we're going to download information for all of the videos

# Define the channel of interest
channel_url = "https://www.youtube.com/c/theneedledrop"

most_recent_video_url = None

# Indicate the step size for parsing the videos
video_parse_step_size = 350

Now that I've got the method scoped out, I'm going to write it. I'll identify the first couple of videos. 

In [6]:
def get_video_urls_from_channel(channel, most_recent_video_url=None, video_limit=None, video_parse_step_size=10):
    """
    Helper method to identify all of the video URLs from a channel.
    If `most_recent_video_url` is None, then we're going to download information for all of the videos we can, 
    all the way up to the `video_limit`. If *that* is None, then we're going to download information for all of the videos.
    The `video_parse_step_size` indicates how many videos we're going to parse at a time.
    """
    
    # Initialize the video URLs
    video_urls = []
    
    # Initialize the video count
    video_count = 0
    
    # Iterate through the channel's videos until we find the `most_recent_video_url`
    while most_recent_video_url not in video_urls:
        
        # Fetch the video URLs
        new_video_urls = channel.video_urls[video_count:video_count+video_parse_step_size]
        
        # Break out if no new video URLs were found
        if len(new_video_urls) == 0:
            break
        
        video_urls.extend(new_video_urls)
        
        # Update the video count
        video_count += video_parse_step_size
        
        # If we've reached the video limit, then break
        if (video_limit is not None and video_count >= video_limit):
            break
    
    # Return the video URLs
    return video_urls

This method should function to do what I want. Let's test it: 

In [7]:
# This is the list of video URLs we're going to parse
video_urls_to_parse = get_video_urls_from_channel(
    channel=Channel(channel_url),
    most_recent_video_url=None,
    video_limit=video_limit,
    video_parse_step_size=video_parse_step_size,
)

# If the most_recent_video_url is not None, then we're going to remove all of the videos that come after it
try:
    if most_recent_video_url is not None:
        video_urls_to_parse = video_urls_to_parse[
            : video_urls_to_parse.index(most_recent_video_url)
        ]
    else:
        pass
# If we run into an error, then we're going to print out the traceback
except Exception as e:
    traceback.print_exc()

# Print some information about the video URLs we're going to parse
print(f"Identified {len(video_urls_to_parse)} videos to parse.")

Identified 1050 videos to parse.


### Removing Already Parsed Videos
We're going to upload a temporary table with all of the video URLs to GBQ. 

In [8]:
# Create a DataFrame from the video URLs
video_urls_to_parse_df = pd.DataFrame(video_urls_to_parse, columns=["url"])

# Create a temporary table in GBQ
temporary_table_name = gbq_utils.create_temporary_table_in_gbq(
    dataframe=video_urls_to_parse_df,
    project_id=GBQ_PROJECT_ID,
    dataset_name=GBQ_DATASET_ID,
    table_name="temporary_video_urls_to_parse",
    if_exists="replace",
)

100%|██████████| 1/1 [00:00<?, ?it/s]


With this temporary table in hand, we'll query GBQ to figure out the *actual* videos to download. 

In [9]:
# Create the query to identify the videos that we need to parse
actual_videos_to_parse_query = f"""
SELECT
  temp_urls.url
FROM
  `{temporary_table_name}` temp_urls
LEFT JOIN
  `backend_data.video_metadata` metadata
ON
  metadata.url = temp_urls.url
WHERE
  metadata.id IS NULL
"""

# Execute the query
actual_videos_to_parse_df = pd.read_gbq(
    actual_videos_to_parse_query, project_id=GBQ_PROJECT_ID
)

Finally, some cleanup: setting the `actual_videos_to_parse_df` contents to the video_urls_to_parse, and deleting the temporary table. 

In [10]:
# Print some information about the videos we're going to parse
print(f"After filtering out videos that have already been parsed, we have {len(actual_videos_to_parse_df)} videos to parse.")

# Overriding the video_urls_to_parse with the contents of the actual_videos_to_parse_df
video_urls_to_parse = list(actual_videos_to_parse_df["url"])

# Use the gbq_utils to delete the temporary table
temp_table_project_id, temp_table_dataset_id, temp_table_name = temporary_table_name.split(".")
gbq_utils.delete_table(
    project_id=temp_table_project_id,
    dataset_id=temp_table_dataset_id,
    table_id=temp_table_name,
)

After filtering out videos that have already been parsed, we have 650 videos to parse.
Table backend_data:temporary_video_urls_to_parse deleted.


# Downloading Video Metadata
Below, I'm going to define a method that'll download a video's metadata. 

In [11]:
def parse_metadata_from_video(video_url):
    """
    This method will parse a dictionary containing metadata from a video, given its URL.
    """

    # Create a video object
    video = YouTube(video_url)

    # Keep a dictionary to keep track of the metadata we're interested in
    video_metadata_dict = {}

    # We'll wrap this in a try/except block so that we can catch any errors that occur
    try:
        # Parse the `videoDetails` from the video; this contains a lot of the metadata we're interested in
        vid_info_dict = video.vid_info
        video_info_dict = vid_info_dict.get("videoDetails")

    # If we run into an Exception this early on, we'll raise an Exception
    except Exception as e:
        raise Exception(
            f"Error parsing video metadata for video {video_url}: '{e}'\nTraceback is as follows:\n{traceback.format_exc()}"
        )

    # Extract different pieces of the video metadata
    video_metadata_dict["id"] = video_info_dict.get("videoId")
    video_metadata_dict["title"] = video_info_dict.get("title")
    video_metadata_dict["length"] = video_info_dict.get("lengthSeconds")
    video_metadata_dict["channel_id"] = video_info_dict.get("channelId")
    video_metadata_dict["channel_name"] = video_info_dict.get("author")
    video_metadata_dict["short_description"] = video_info_dict.get("shortDescription")
    video_metadata_dict["view_ct"] = video_info_dict.get("viewCount")
    video_metadata_dict["url"] = video_info_dict.get("video_url")
    video_metadata_dict["small_thumbnail_url"] = (
        video_info_dict.get("thumbnail").get("thumbnails")[0].get("url")
    )
    video_metadata_dict["large_thumbnail_url"] = (
        video_info_dict.get("thumbnail").get("thumbnails")[-1].get("url")
    )

    # Try and extract the the publish_date
    try:
        publish_date = video.publish_date
        video_metadata_dict["publish_date"] = publish_date
    except:
        video_metadata_dict["publish_date"] = None

    # Try and extract the full description
    try:
        full_description = video.description
        video_metadata_dict["description"] = full_description
    except:
        video_metadata_dict["description"] = None
    
    # Use datetime to get the scrape_date (the current datetime)
    video_metadata_dict["scrape_date"] = datetime.datetime.now()
    
    # Add the url to the video_metadata_dict
    video_metadata_dict["url"] = video_url

    # Finally, return the video metadata dictionary
    return video_metadata_dict

Now: we'll need to iterate through each of the videos and download their metadata. 

In [12]:
# Parameterize the video metadata parsing
time_to_sleep_between_parsing = 5
sleep_randomization_factor = 3.5

# We'll iterate through each of the videos in the list and parse their metadata
video_metadata_dicts_by_video_url = {}
for video_url in tqdm(video_urls_to_parse, disable=not tqdm_enabled):
    
    # We'll wrap this in a try/except block so that we can catch any errors that occur
    try:
        # Parse the metadata from the video
        video_metadata_dict = parse_metadata_from_video(video_url)
        
        # Add the video metadata dictionary to the dictionary of video metadata dictionaries
        video_metadata_dicts_by_video_url[video_url] = video_metadata_dict
        
        # Sleep for a random amount of time
        time_to_sleep = random.uniform(time_to_sleep_between_parsing, time_to_sleep_between_parsing + (sleep_randomization_factor * time_to_sleep_between_parsing))
        time.sleep(time_to_sleep)
    
    # If we run into an Exception, then we'll print out the traceback
    except Exception as e:
        traceback.print_exc()

100%|██████████| 650/650 [2:39:19<00:00, 14.71s/it]  


## Storing the Metadata
Now that I've downloaded some metadata about different videos, I need to store it. 

In [13]:
# Create a list of the rows to add to the table
rows_to_add = [val for val in video_metadata_dicts_by_video_url.values()]

# Add the rows to the table
gbq_utils.add_rows_to_table(
    project_id=GBQ_PROJECT_ID,
    dataset_id=GBQ_DATASET_ID,
    table_id="video_metadata",
    rows=rows_to_add   
)

Loaded 650 row(s) into backend_data:video_metadata.


# Downloading Video Audio
Next up, I need to download some video audio. This one probably needs to go a lot slower than the metadata fetching 😅

### Determining Audio to Download
I need to check with GBQ to see if there are any videos that I need to download. 

In [14]:
# Parameterize the query
n_max_video_urls = 50

# The query below will determine which videos we need to download audio for
videos_for_audio_parsing_query = f"""
SELECT
  video.url
FROM
  `backend_data.video_metadata` video
LEFT JOIN
  `backend_data.audio` audio
ON
  audio.video_url = video.url
WHERE
  audio.audio_gcs_uri IS NULL
LIMIT {n_max_video_urls}
"""

# Execute the query
videos_for_audio_parsing_df = pd.read_gbq(
    videos_for_audio_parsing_query, project_id=GBQ_PROJECT_ID
)

### Downloading Audio
Next, I'm going to use `pytube` to download the audio of these videos.

In [15]:
# Parameterize the download
time_to_sleep_between_downloads = 25
sleep_randomization_factor = 3.5
download_directory = Path("temp_data/")

# Iterate through the videos and download their audio
for video_url in tqdm(videos_for_audio_parsing_df["url"], disable=not tqdm_enabled):
    # We'll wrap this in a try/except block so that we can catch any errors that occur
    try:
        # Download the audio from the video
        youtube_utils.download_audio_from_video(
            video_url=video_url, data_folder_path=download_directory
        )

    # If we run into an Exception, then we'll print out the traceback
    except Exception as e:
        traceback.print_exc()

100%|██████████| 50/50 [01:48<00:00,  2.17s/it]


### Uploading Audio to GCS
Now that I've downloaded the audio, I need to upload it to GCS. 

I'll start by creating the bucket if it doesn't exist:

In [35]:
# Make sure that the neural-needledrop-audio bucket exists
gcs_utils.create_bucket(
    "neural-needledrop-audio", project_id=GBQ_PROJECT_ID, delete_if_exists=False
)

Bucket neural-needledrop.neural-needledrop-audio deleted
Bucket neural-needledrop.neural-needledrop-audio created


Next: upload all of the audio. 

In [17]:
# Parameterize the audio upload process
delete_files_after_upload = True

# Iterate through all of the video urls in the videos_for_audio_parsing_df
for row in tqdm(
    list(videos_for_audio_parsing_df.itertuples()), disable=not tqdm_enabled
):
    # We'll wrap this in a try/except block so that we can catch any errors that occur
    try:
        # Get the video url
        video_url = row.url

        # Get the video id
        video_id = video_url.split("watch?v=")[-1]

        # Get the path to the audio file
        audio_file_path = download_directory / f"{video_id}.m4a"

        # Check to see if this file exists
        if not Path(audio_file_path).exists():
            # If it doesn't exist, then we'll continue. Print out a warning
            print(f"Warning: {audio_file_path} does not exist. Skipping...")
            continue

        # Get the GCS URI
        gcs_uri = f"neural-needledrop-audio"

        # Upload the audio file to GCS
        audio_file_path_str = str(audio_file_path)

        # Convert the audio file to .mp3 using youtube_utils
        youtube_utils.convert_m4a_to_mp3(
            input_file_path=audio_file_path_str,
            output_file_path=audio_file_path_str.replace(".m4a", ".mp3"),
        )

        # Remove the .m4a file
        audio_file_path.unlink()

        # Update the audio_file_path_str
        audio_file_path = Path(audio_file_path_str.replace(".m4a", ".mp3"))
        audio_file_path_str = str(audio_file_path)

        gcs_utils.upload_file_to_bucket(
            file_path=audio_file_path_str,
            bucket_name=gcs_uri,
            project_id=GBQ_PROJECT_ID,
        )

        # Create a dictionary to store the audio metadata
        audio_metadata_dict = {
            "video_url": video_url,
            "audio_gcs_uri": f"gs://{gcs_uri}/{audio_file_path.name}",
            "scrape_date": datetime.datetime.now(),
        }

        # Add the audio metadata to the table
        try:
            gbq_utils.add_rows_to_table(
                project_id=GBQ_PROJECT_ID,
                dataset_id=GBQ_DATASET_ID,
                table_id="audio",
                rows=[audio_metadata_dict],
            )
        except NotFound:
            gbq_utils.generate_audio_table(
                project_id=GBQ_PROJECT_ID,
                dataset_id=GBQ_DATASET_ID,
                delete_if_exists=False,
            )
            gbq_utils.add_rows_to_table(
                project_id=GBQ_PROJECT_ID,
                dataset_id=GBQ_DATASET_ID,
                table_id="audio",
                rows=[audio_metadata_dict],
            )

        # Delete the audio file if delete_files_after_upload
        if delete_files_after_upload:
            audio_file_path.unlink()

    # If we run into an Exception, then we'll print out the traceback
    except Exception as e:
        traceback.print_exc()

# If we're deleting the files after upload, then we'll delete the download_directory
if delete_files_after_upload:
    Path(download_directory).rmdir()

File temp_data\Gwtat22jkig.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 82%|████████▏ | 41/50 [09:58<02:03, 13.70s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file JmhWyeGHz-k.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\JmhWyeGHz-k.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 84%|████████▍ | 42/50 [10:15<01:59, 14.97s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file m9NrCOoOpUw.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\m9NrCOoOpUw.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 86%|████████▌ | 43/50 [10:27<01:37, 13.99s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file IkWLUgl8W9E.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\IkWLUgl8W9E.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 88%|████████▊ | 44/50 [10:41<01:23, 13.88s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file S28rYHXVbYg.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\S28rYHXVbYg.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 90%|█████████ | 45/50 [11:04<01:22, 16.55s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file EOphU_IUBNU.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\EOphU_IUBNU.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 92%|█████████▏| 46/50 [11:16<01:01, 15.42s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file fx7de5AojVk.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\fx7de5AojVk.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 94%|█████████▍| 47/50 [11:29<00:43, 14.45s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file YVHThYHhmcc.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\YVHThYHhmcc.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 96%|█████████▌| 48/50 [11:44<00:29, 14.87s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file VgZkcUA3xko.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\VgZkcUA3xko.mp3 uploaded to neural-needledrop.neural-needledrop-audio


 98%|█████████▊| 49/50 [11:56<00:13, 13.90s/it]

Loaded 1 row(s) into backend_data:audio.
Found a bucket named neural-needledrop-audio in project neural-needledrop.
Uploading file eNdoI_lKA78.mp3 to neural-needledrop.neural-needledrop-audio...
File temp_data\eNdoI_lKA78.mp3 uploaded to neural-needledrop.neural-needledrop-audio


100%|██████████| 50/50 [12:09<00:00, 14.60s/it]

Loaded 1 row(s) into backend_data:audio.





Now, a super quick fix that I ought to handle: I'm going to deduplicate the `backend_data.audio` table.

In [18]:
# This query will deduplicate the audio table
deduplicate_audio_table_query = f"""
CREATE OR REPLACE TABLE `{GBQ_PROJECT_ID}.{GBQ_DATASET_ID}.audio` AS (
    SELECT
        video_url,
        audio_gcs_uri,
        scrape_date
    FROM (
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY video_url ORDER BY scrape_date DESC) AS row_num
        FROM
            `{GBQ_PROJECT_ID}.{GBQ_DATASET_ID}.audio`
    ) ordered_table
    WHERE
        ordered_table.row_num = 1
)
"""

# Execute the query
pandas_gbq.read_gbq(deduplicate_audio_table_query, project_id=GBQ_PROJECT_ID)

Downloading: 100%|[32m██████████[0m|


Unnamed: 0,video_url,audio_gcs_uri,scrape_date
0,https://www.youtube.com/watch?v=b33ZjsTHVXU,gs://neural-needledrop-audio/b33ZjsTHVXU.m4a,2024-01-06 10:21:18.066085
1,https://www.youtube.com/watch?v=LwbUvvxf2I8,gs://neural-needledrop-audio/LwbUvvxf2I8.m4a,2024-01-06 13:41:32.128619
2,https://www.youtube.com/watch?v=VIsFCth2IHo,gs://neural-needledrop-audio/VIsFCth2IHo.m4a,2024-01-06 13:41:54.614713
3,https://www.youtube.com/watch?v=ImYGdMBWLFw,gs://neural-needledrop-audio/ImYGdMBWLFw.m4a,2024-01-06 13:43:48.034852
4,https://www.youtube.com/watch?v=m99lkzqB-E0,gs://neural-needledrop-audio/m99lkzqB-E0.m4a,2024-01-06 13:45:23.876181
...,...,...,...
62,https://www.youtube.com/watch?v=xaNzs6IL3pk,gs://neural-needledrop-audio/xaNzs6IL3pk.m4a,2024-01-06 10:16:42.185417
63,https://www.youtube.com/watch?v=hAx_VUHss9w,gs://neural-needledrop-audio/hAx_VUHss9w.m4a,2024-01-06 13:44:41.678088
64,https://www.youtube.com/watch?v=GElu5cR8XOU,gs://neural-needledrop-audio/GElu5cR8XOU.m4a,2024-01-06 13:39:45.842374
65,https://www.youtube.com/watch?v=N2XsYgp6Z8w,gs://neural-needledrop-audio/N2XsYgp6Z8w.m4a,2024-01-06 13:42:50.046260


# Transcribing Audio with Whisper
Now that I've downloaded some audio, I need to figure out what needs to be transcribed. I can do that by checking the `audio` and `transcriptions` table. 

In [36]:
# This query will determine all of the videos we need to transcribe
videos_for_transcription_query = f"""
SELECT
  DISTINCT(audio.video_url) AS url,
  audio.audio_gcr_uri
FROM
  `backend_data.audio` audio 
LEFT JOIN
  `backend_data.transcriptions` transcript
ON
  audio.video_url = transcript.url
WHERE
  transcript.created_at IS NULL
"""

# Execute the query
videos_for_transcription_df = pd.read_gbq(
    videos_for_transcription_query, project_id=GBQ_PROJECT_ID
)

Now, with all of the audio specified, we need to try and download it from `GCS`. 

In [54]:
# Iterate through all of the video urls in the videos_for_transcription_df
for row in tqdm(
    list(videos_for_transcription_df.itertuples()), disable=not tqdm_enabled
):
    # Parse the GCS URI
    split_gcs_uri = row.audio_gcr_uri.split("gs://")[-1]
    bucket_name, file_name = split_gcs_uri.split("/")[0], "/".join(
        split_gcs_uri.split("/")[1:]
    )

    # Download the audio
    gcs_utils.download_file_from_bucket(
        bucket_name=bucket_name,
        file_name=file_name,
        destination_folder="temp_data/",
        project_id=GBQ_PROJECT_ID,
    )

  0%|          | 0/17 [00:00<?, ?it/s]

Found a bucket named neural-needledrop-audio in project neural-needledrop.


  6%|▌         | 1/17 [00:01<00:28,  1.75s/it]

File 8un7EwrKW0Y.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 12%|█▏        | 2/17 [00:03<00:27,  1.80s/it]

File KAedi0Mtfj4.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 18%|█▊        | 3/17 [00:05<00:27,  1.96s/it]

File NU8gDMotlP4.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 24%|██▎       | 4/17 [00:07<00:25,  1.96s/it]

File G3xgXi6VeXU.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 29%|██▉       | 5/17 [00:09<00:23,  1.97s/it]

File _7gS3LgWNKw.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 35%|███▌      | 6/17 [00:11<00:20,  1.82s/it]

File UTSdgU3dZco.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 41%|████      | 7/17 [00:13<00:19,  1.91s/it]

File PlloTwEBBE8.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 47%|████▋     | 8/17 [00:15<00:16,  1.87s/it]

File b33ZjsTHVXU.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 53%|█████▎    | 9/17 [00:16<00:15,  1.88s/it]

File 3CDeF0DViao.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 59%|█████▉    | 10/17 [00:18<00:13,  1.86s/it]

File sGUKtWa7lVg.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 65%|██████▍   | 11/17 [00:20<00:10,  1.79s/it]

File 2eftEUAmNZQ.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 71%|███████   | 12/17 [00:22<00:08,  1.75s/it]

File W_Yu9j3AEXQ.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 76%|███████▋  | 13/17 [00:24<00:07,  1.95s/it]

File DoWlDA-GKIQ.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 82%|████████▏ | 14/17 [00:26<00:05,  1.91s/it]

File xaNzs6IL3pk.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 88%|████████▊ | 15/17 [00:28<00:03,  1.85s/it]

File 3iaoQarq9ZA.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


 94%|█████████▍| 16/17 [00:29<00:01,  1.77s/it]

File dCK78Czmym0.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/
Found a bucket named neural-needledrop-audio in project neural-needledrop.


100%|██████████| 17/17 [00:31<00:00,  1.83s/it]

File _PDzVzOG4nw.mp3 downloaded from neural-needledrop.neural-needledrop-audio to temp_data/





In [59]:
# We'll store the audio metadata in a dictionary
audio_metadata_dict_by_video_url = {}

# Iterate through each of the files in the `temp_data` directory and transcribe them
for child_file in tqdm(list(Path("temp_data/").iterdir()), disable=not tqdm_enabled):
    try:
        if child_file.suffix != ".mp3":
            continue

        # Extract some data about the file
        video_url = f"https://www.youtube.com/watch?v={child_file.stem}"
        video_id = child_file.stem

        # Use whisper to transcribe the audio
        whisper_transcription = whisper_model.transcribe(str(child_file), fp16=False)

        # Store the transcription in the audio_metadata_dict_by_video_url
        audio_metadata_dict_by_video_url[video_url] = whisper_transcription
    except Exception as e:
        raise Exception(
            f"Error getting audio file path for video {video_url}: '{e}'\nTraceback is as follows:\n{traceback.format_exc()}"
        )

100%|██████████| 17/17 [17:24<00:00, 61.42s/it]


### Uploading Transcription to GBQ
Now that I've transcribed all of these videos, I'm going to upload the transcriptions to GBQ. 

First, I'll transform the DataFrame to add some needed data:

In [60]:
# Create a DataFrame from the audio_metadata_dict_by_video_url
audio_metadata_df = pd.DataFrame.from_dict(
    audio_metadata_dict_by_video_url, orient="index"
)

# Reset the index into a "url" column
audio_metadata_df.reset_index(inplace=True, names=["url"])

# Explode the "segments" column
audio_metadata_df = audio_metadata_df.explode("segments")

# Rename the "segment" column to "segment" in the audio_metadata_df
audio_metadata_df = audio_metadata_df.rename(columns={"segments": "segment"})

# Add a "created_at" column to the audio_metadata_df
audio_metadata_df["created_at"] = datetime.datetime.now()

# Alter the "text" column so that it's extracted from the "segment" column
audio_metadata_df["text"] = audio_metadata_df["segment"].apply(
    lambda x: x.get("text", None)
)

# Add a "segment_type" column to the audio_metadata_df
audio_metadata_df["segment_type"] = "small_segment"

# We're going to extract some columns from the `segment` dictionary
segment_columns_to_keep = ["id", "seek", "start", "end"]
normalized_segments_df = pd.json_normalize(audio_metadata_df["segment"])
normalized_segments_df = normalized_segments_df[segment_columns_to_keep]

# Rename all of the columns so that they have "segment_" prepended to them
normalized_segments_df = normalized_segments_df.rename(
    columns={col: f"segment_{col}" for col in normalized_segments_df.columns}
)

# Make the final_transcription_df
final_transcription_df = pd.concat(
    [
        audio_metadata_df.drop(columns=["segment"]).reset_index(drop=True),
        normalized_segments_df.reset_index(drop=True),
    ],
    axis=1,
).copy()

This is a little trickier than I thought. I'll also want to add "full video" transcriptions. 

In [61]:
# Iterate through each of the unique URLs in the final_transcription_df, and
# create a new "segment" row for each of them. This row will have the "segment_type"
# of "full_segment", and the "segment_text" will be the concatenation of all of the
# "text" values for that video. The segment_start will be 0, and the segment_end will
# be the length of the video.
new_rows = []
for video_url in final_transcription_df["url"].unique():
    # Subset the final_transcription_df to just the rows for this video_url
    cur_video_url_df = final_transcription_df[
        final_transcription_df["url"] == video_url
    ]

    # Create a full transcription for this video
    full_transcription = "".join(
        cur_video_url_df.sort_values("segment_start")["text"]
    ).strip()

    # Create a new segment_id
    segment_id = -1

    # Calculate the maximum segment_seek
    segment_seek = cur_video_url_df["segment_seek"].max()

    # Calculate the earliest segment_start and the latest segment_end
    segment_start = cur_video_url_df["segment_start"].min()
    segment_end = cur_video_url_df["segment_end"].max()

    # Add a new row to the new_rows list
    new_rows.append(
        {
            "url": video_url,
            "text": full_transcription,
            "language": "en",
            "created_at": datetime.datetime.now(),
            "segment_type": "full_transcription",
            "segment_id": segment_id,
            "segment_seek": segment_seek,
            "segment_start": segment_start,
            "segment_end": segment_end,
        }
    )

# Add these new rows to the table
final_transcription_df = pd.DataFrame.from_records(
    final_transcription_df.to_dict(orient="records") + new_rows
)

# Deduplicate on the video_id and segment_id
final_transcription_df = final_transcription_df.drop_duplicates(
    subset=["url", "segment_id"]
)

Then, I'll make a temporary table. 

In [62]:
# Define the name of the table we're going to create
table_name = "temp_transcriptions"

# Create the table
gbq_utils.create_temporary_table_in_gbq(
    dataframe=final_transcription_df,
    project_id=GBQ_PROJECT_ID,
    dataset_name=GBQ_DATASET_ID,
    table_name=table_name,
    if_exists="replace"
)

100%|██████████| 1/1 [00:00<?, ?it/s]


'neural-needledrop.backend_data.temp_transcriptions'

Now, with this temporary table in hand, I'm going to try and identify the videos whose transcriptions haven't been added yet. 

In [63]:
# The following query will determine which transcripts we need to upload
transcripts_to_upload_query = f"""
SELECT
  DISTINCT(temp_transcript.url)
FROM
  `backend_data.temp_transcriptions` temp_transcript
LEFT JOIN
  `backend_data.transcriptions` transcript
ON
  transcript.url = temp_transcript.url
WHERE
  transcript.created_at IS NULL
"""

# Execute the query
transcripts_to_upload_df = pd.read_gbq(
    transcripts_to_upload_query, project_id=GBQ_PROJECT_ID
)

Now that we've cross-referenced with the table, let's upload them. 

In [64]:
# Create a DataFrame containing the transcripts that we need to upload
final_transcriptions_to_upload_df = final_transcription_df.merge(
    transcripts_to_upload_df, on="url"
)

# Use the gbq_utils to add rows to the `backend_data.transcriptions` table
gbq_utils.add_rows_to_table(
    project_id=GBQ_PROJECT_ID,
    dataset_id=GBQ_DATASET_ID,
    table_id="transcriptions",
    rows=final_transcription_df.to_dict(orient="records"),
)

Loaded 2500 row(s) into backend_data:transcriptions.


Finally, we'll delete the `temp_transactions` table and the `temp_data` directory. 

In [67]:
# Delete the temporary table
gbq_utils.delete_table(
    project_id=GBQ_PROJECT_ID,
    dataset_id=GBQ_DATASET_ID,
    table_id=table_name,
)

# Delete the temp_data directory and everything in it
for child_file in Path("temp_data/").iterdir():
    child_file.unlink()
Path("temp_data/").rmdir()

Table backend_data:temp_transcriptions does not exist.


# Enriching Video Metadata
The next part of the pipeline involves enriching the video data. 

# Embedding Transcriptions
Next up: we're going to embed some of the different audio transcriptions we've got. 