# **Finalizing Pipeline**
In **`02. Prototyping Data Pipeline`**, I scoped out the entire data pipeline. Once I knew that it was running properly, I wanted to make it more configurable and contained. 

This notebook is going to be similar to that notebook, but will invoke entire configurable pipeline methods instead of multiple-cell stretches for each section. 

Each of the methods invoked will correspond with a step of the pipeline. There are a couple of different ones: 

- **Initialize Cloud Resources:** This will make sure that all of the GBQ tables & GCS buckets exist. It'll have an optional attribute for deleting *everything*. 

- **Download Video Metadata:** Next up: downloading some video metadata. This will identify which videos that a user needs to find, and then uses `pytube` to download some metadata. 

- **Enrich Video Metadata:** This step will determine what type of video each video is (album review, weekly track roundup, etc.), and extract review scores from the description

- **Downloading Video Audio:** This step will download the audio of videos we haven't downloaded yet

- **Transcribing Audio:** Next: this step uses OpenAI's Whisper model to transcribe all of the audio we've downloaded. 

- **Embedding Transcriptions:** Finally, we're going to embed some of the transcriptions that we've created using Whisper. 

Each of these methods shares a couple of key design steps: 

- Idempotency: The methods can be retried, and won't necessarily overwrite things
- Logging: Each of the methods logs information (using various logging levels) to Google Cloud Logging 
- Configurable: Different parameters of the pipeline step can be indicated through command line arguments. I've also got a way to load in these configurations via .yml files. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\pipeline


Now I'll import some necessary modules:

In [2]:
# Set up some environment variables to configure the logging 
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=INFO
%env TQDM_ENABLED=True

# General import statements
import pandas as pandas

# Import each of the different jobs
from jobs.initialize_cloud_resources import run_initialize_cloud_resources_job
from jobs.download_video_metadata import run_download_video_metadata_job
from jobs.enrich_video_metadata import run_enrich_video_metadata_job
from jobs.download_audio import run_download_audio_job
from jobs.transcribe_audio import run_transcribe_audio_job
from jobs.embed_transcriptions import run_embed_transcriptions_job

env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=INFO
env: TQDM_ENABLED=True


# Running Jobs
Below, I'm going to run each of the individual jobs. 

### Initialize Cloud Resources


In [3]:
# Run the initialize resources job
run_initialize_cloud_resources_job()

2024-01-26 10:50:44,542 - pipeline.initialize_resources - INFO - Starting the INITIALIZE RESOURCES job.
2024-01-26 10:50:49,668 - pipeline.initialize_resources - INFO - Finished the INITIALIZE RESOURCES job.


### Downloading Video Metadata

In [4]:
# Run the download video metadata job
run_download_video_metadata_job(
    channel_url="https://www.youtube.com/c/theneedledrop",
    video_limit=10,
    stop_at_most_recent_video=True,
    video_parse_step_size=25,
    time_to_sleep_between_requests=2,
    sleep_time_multiplier=2.25,
)

2024-01-26 10:50:49,774 - pipeline.download_video_metadata - INFO - Starting the DOWNLOAD VIDEO METADATA job.
2024-01-26 10:50:49,775 - pipeline.download_video_metadata - INFO - Crawling channel https://www.youtube.com/c/theneedledrop.
  most_recent_video_url_df = pd.read_gbq(
2024-01-26 10:50:53,651 - pipeline.download_video_metadata - INFO - Most recent video url found: https://www.youtube.com/watch?v=OelpOL9bLTY
  dataframe.to_gbq(
  actual_videos_to_parse_df = pd.read_gbq(
2024-01-26 10:51:01,871 - pipeline.download_video_metadata - INFO - There aren't any videos to parse for channel https://www.youtube.com/c/theneedledrop. Exiting.


### Enrich Video Metadata


In [5]:
# Run the enrich video metadata job
run_enrich_video_metadata_job()

2024-01-26 10:51:01,972 - pipeline.enrich_video_metadata - INFO - Starting the ENRICH VIDEO METADATA job.
  videos_to_enrich_df = pd.read_gbq(videos_to_enrich_query, project_id=GBQ_PROJECT_ID)
2024-01-26 10:51:03,535 - pipeline.enrich_video_metadata - INFO - No videos to enrich. Exiting the job...


### Downloading Audio

In [6]:
# Run the run_download_audio_job
run_download_audio_job(
    n_max_videos_to_download=1,
)

2024-01-26 10:51:03,634 - pipeline.download_audio - INFO - Starting the DOWNLOAD AUDIO job.
  videos_for_audio_parsing_df = pd.read_gbq(
2024-01-26 10:51:07,260 - pipeline.download_audio - INFO - No videos to download audio for. Exiting job...


### Transcribing Audio

In [7]:
# Run the transcribe_audio job
run_transcribe_audio_job(
    n_max_to_transcribe=1,
)

2024-01-26 10:51:07,712 - pipeline.transcribe_audio - INFO - Starting the TRANSCRIBE AUDIO job.
  videos_for_transcription_df = pd.read_gbq(
2024-01-26 10:51:11,358 - pipeline.transcribe_audio - INFO - Found 1 videos whose audio we need to transcribe.
100%|██████████| 1/1 [00:01<00:00,  1.97s/it]
2024-01-26 10:51:13,330 - pipeline.transcribe_audio - INFO - Finished downloading the audio of the videos we need to transcribe.
2024-01-26 10:51:13,330 - pipeline.transcribe_audio - INFO - Starting the transcription process.
100%|██████████| 1/1 [00:40<00:00, 40.89s/it]
2024-01-26 10:51:54,219 - pipeline.transcribe_audio - INFO - Finished transcribing the audio.
  dataframe.to_gbq(
  transcripts_to_upload_df = pd.read_gbq(
2024-01-26 10:52:03,386 - pipeline.transcribe_audio - INFO - Finished the TRANSCRIBE AUDIO job.


### Embedding Audio

In [8]:
# Run the embed_transcriptions job
run_embed_transcriptions_job(max_videos_to_embed=700)

  transcriptions_to_embed_df = pd.read_gbq(
2024-01-26 10:52:07,507 - pipeline.embed_transcriptions - INFO - Found 118 individual transcription segments to embed.
2024-01-26 10:52:07,513 - pipeline.embed_transcriptions - INFO - Groupped the transcription segments into 45 segment chunks.
2024-01-26 10:52:07,514 - pipeline.embed_transcriptions - INFO - Embedding the transcription chunks...
Embedding Texts: 100%|██████████| 45/45 [00:02<00:00, 18.21it/s]
2024-01-26 10:52:09,993 - pipeline.embed_transcriptions - INFO - Finished embedding the transcription chunks.
2024-01-26 10:52:09,994 - pipeline.embed_transcriptions - INFO - Storing the embeddings in GCS...
100%|██████████| 45/45 [00:02<00:00, 17.47it/s]
