# **Finalizing Pipeline**
In **`02. Prototyping Data Pipeline`**, I scoped out the entire data pipeline. Once I knew that it was running properly, I wanted to make it more configurable and contained. 

This notebook is going to be similar to that notebook, but will invoke entire configurable pipeline methods instead of multiple-cell stretches for each section. 

Each of the methods invoked will correspond with a step of the pipeline. There are a couple of different ones: 

- **Initialize Cloud Resources:** This will make sure that all of the GBQ tables & GCS buckets exist. It'll have an optional attribute for deleting *everything*. 

- **Download Video Metadata:** Next up: downloading some video metadata. This will identify which videos that a user needs to find, and then uses `pytube` to download some metadata. 

- **Enrich Video Metadata:** This step will determine what type of video each video is (album review, weekly track roundup, etc.), and extract review scores from the description

- **Downloading Video Audio:** This step will download the audio of videos we haven't downloaded yet

- **Transcribing Audio:** Next: this step uses OpenAI's Whisper model to transcribe all of the audio we've downloaded. 

- **Embedding Transcriptions:** Finally, we're going to embed some of the transcriptions that we've created using Whisper. 

Each of these methods shares a couple of key design steps: 

- Idempotency: The methods can be retried, and won't necessarily overwrite things
- Logging: Each of the methods logs information (using various logging levels) to Google Cloud Logging 
- Configurable: Different parameters of the pipeline step can be indicated through command line arguments. I've also got a way to load in these configurations via .yml files. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\pipeline


Now I'll import some necessary modules:

In [2]:
# Set up some environment variables to configure the logging 
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=INFO
%env TQDM_ENABLED=True

# General import statements
import pandas as pd

# Import each of the different jobs
from jobs.initialize_cloud_resources import run_initialize_cloud_resources_job
from jobs.download_video_metadata import run_download_video_metadata_job
from jobs.enrich_video_metadata import run_enrich_video_metadata_job
from jobs.download_audio import run_download_audio_job
from jobs.transcribe_audio import run_transcribe_audio_job
from jobs.embed_transcriptions import run_embed_transcriptions_job

env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=INFO
env: TQDM_ENABLED=True


# Running Jobs
Below, I'm going to run each of the individual jobs. 

### Initialize Cloud Resources


In [3]:
# Run the initialize resources job
run_initialize_cloud_resources_job()

2024-01-31 22:28:15,511 - pipeline.initialize_resources - INFO - Starting the INITIALIZE RESOURCES job.
2024-01-31 22:28:24,917 - pipeline.initialize_resources - INFO - Finished the INITIALIZE RESOURCES job.


### Downloading Video Metadata

In [4]:
# Run the download video metadata job
run_download_video_metadata_job(
    channel_url="https://www.youtube.com/c/theneedledrop",
    video_limit=10,
    stop_at_most_recent_video=True,
    video_parse_step_size=25,
    time_to_sleep_between_requests=2,
    sleep_time_multiplier=2.25,
    n_days_to_not_scrape=1,
)

2024-01-31 22:28:25,968 - pipeline.download_video_metadata - INFO - Determining whether or not to scrape this channel.
2024-01-31 22:28:29,702 - pipeline.download_video_metadata - INFO - Starting the DOWNLOAD VIDEO METADATA job.
2024-01-31 22:28:29,703 - pipeline.download_video_metadata - INFO - Crawling channel https://www.youtube.com/c/theneedledrop.
2024-01-31 22:28:31,895 - pipeline.download_video_metadata - INFO - Most recent video url found: https://www.youtube.com/watch?v=AXeLuZ9SUSM
2024-01-31 22:28:42,455 - pipeline.download_video_metadata - INFO - Found 1 videos to parse for channel https://www.youtube.com/c/theneedledrop.
100%|██████████| 1/1 [00:04<00:00,  4.59s/it]
2024-01-31 22:28:47,054 - pipeline.download_video_metadata - INFO - Finished parsing metadata for 1 videos.
2024-01-31 22:28:47,971 - pipeline.download_video_metadata - INFO - Finished adding rows to the `video_metadata` table.
2024-01-31 22:28:47,972 - pipeline.download_video_metadata - INFO - Finished the DOWN

### Enrich Video Metadata


In [5]:
# Run the enrich video metadata job
run_enrich_video_metadata_job()

2024-01-31 22:28:48,073 - pipeline.enrich_video_metadata - INFO - Starting the ENRICH VIDEO METADATA job.
2024-01-31 22:28:50,137 - pipeline.enrich_video_metadata - INFO - Found 1 videos to enrich
2024-01-31 22:28:50,138 - pipeline.enrich_video_metadata - INFO - Adding `video_type` enrichment.
2024-01-31 22:28:50,140 - pipeline.enrich_video_metadata - INFO - Adding `review_score` enrichment.
2024-01-31 22:28:53,473 - pipeline.enrich_video_metadata - INFO - Finished the ENRICH VIDEO METADATA job.


### Downloading Audio

In [6]:
# Run the run_download_audio_job
run_download_audio_job(
    n_max_videos_to_download=1,
)

2024-01-31 22:28:53,572 - pipeline.download_audio - INFO - Starting the DOWNLOAD AUDIO job.
2024-01-31 22:28:58,550 - pipeline.download_audio - INFO - Downloading audio for 1 videos.
  0%|          | 0/1 [00:00<?, ?it/s]2024-01-31 22:29:09,867 - pipeline.download_audio - ERROR - Error downloading audio for https://www.youtube.com/watch?v=S69_SmKkBIY: 'IncompleteRead(2058102 bytes read, 426611 more expected)'
The traceback is:
Traceback (most recent call last):
  File "d:\data\programming\neural-needledrop\pipeline\jobs\download_audio.py", line 112, in run_download_audio_job
    youtube_utils.download_audio_from_video(
  File "d:\data\programming\neural-needledrop\pipeline\utils\youtube.py", line 109, in download_audio_from_video
    raise e
  File "d:\data\programming\neural-needledrop\pipeline\utils\youtube.py", line 88, in download_audio_from_video
    for stream in video.streams.filter(only_audio=True):
                  ^^^^^^^^^^^^^
  File "d:\data\programming\neural-needledrop\.v

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'temp_audio_data'

### Transcribing Audio

In [None]:
# Run the transcribe_audio job
run_transcribe_audio_job(
    n_max_to_transcribe=1,
)

### Embedding Audio

In [7]:
# Run the embed_transcriptions job
run_embed_transcriptions_job(max_videos_to_embed=10, max_parallel_embedding_workers=3)

2024-01-31 22:29:29,938 - pipeline.embed_transcriptions - INFO - Found 773 individual transcription segments to embed.
2024-01-31 22:29:29,958 - pipeline.embed_transcriptions - INFO - Groupped the transcription segments into 261 segment chunks.
2024-01-31 22:29:29,959 - pipeline.embed_transcriptions - INFO - Embedding the transcription chunks...
Embedding Texts: 100%|██████████| 261/261 [00:37<00:00,  6.98it/s]
2024-01-31 22:30:07,386 - pipeline.embed_transcriptions - INFO - Finished embedding the transcription chunks.
2024-01-31 22:30:07,388 - pipeline.embed_transcriptions - INFO - Storing the embeddings in GCS...
100%|██████████| 261/261 [00:11<00:00, 22.90it/s]
