# **Finalizing Pipeline**
In **`02. Prototyping Data Pipeline`**, I scoped out the entire data pipeline. Once I knew that it was running properly, I wanted to make it more configurable and contained. 

This notebook is going to be similar to that notebook, but will invoke entire configurable pipeline methods instead of multiple-cell stretches for each section. 

Each of the methods invoked will correspond with a step of the pipeline. There are a couple of different ones: 

- **Initialize Cloud Resources:** This will make sure that all of the GBQ tables & GCS buckets exist. It'll have an optional attribute for deleting *everything*. 

- **Download Video Metadata:** Next up: downloading some video metadata. This will identify which videos that a user needs to find, and then uses `pytube` to download some metadata. 

- **Enrich Video Metadata:** This step will determine what type of video each video is (album review, weekly track roundup, etc.), and extract review scores from the description

- **Downloading Video Audio:** This step will download the audio of videos we haven't downloaded yet

- **Transcribing Audio:** Next: this step uses OpenAI's Whisper model to transcribe all of the audio we've downloaded. 

- **Embedding Transcriptions:** Finally, we're going to embed some of the transcriptions that we've created using Whisper. 

Each of these methods shares a couple of key design steps: 

- Idempotency: The methods can be retried, and won't necessarily overwrite things
- Logging: Each of the methods logs information (using various logging levels) to Google Cloud Logging 
- Configurable: Different parameters of the pipeline step can be indicated through command line arguments. I've also got a way to load in these configurations via .yml files. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\pipeline


Now I'll import some necessary modules:

In [2]:
# Set up some environment variables to configure the logging 
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=INFO
%env TQDM_ENABLED=True

# General import statements
import pandas as pandas

# Import each of the different jobs
from jobs.initialize_cloud_resources import run_initialize_cloud_resources_job
from jobs.download_video_metadata import run_download_video_metadata_job
from jobs.enrich_video_metadata import run_enrich_video_metadata_job
from jobs.download_audio import run_download_audio_job
from jobs.transcribe_audio import run_transcribe_audio_job
from jobs.embed_transcriptions import run_embed_transcriptions_job

env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=INFO
env: TQDM_ENABLED=True


# Running Jobs
Below, I'm going to run each of the individual jobs. 

### Initialize Cloud Resources


In [3]:
# Run the initialize resources job
run_initialize_cloud_resources_job()

2024-01-20 17:22:13,178 - pipeline.initialize_resources - INFO - Starting the INITIALIZE RESOURCES job.
2024-01-20 17:22:16,913 - pipeline.initialize_resources - INFO - Finished the INITIALIZE RESOURCES job.


### Downloading Video Metadata

In [4]:
# Run the download video metadata job
run_download_video_metadata_job(
    channel_url="https://www.youtube.com/c/theneedledrop",
    video_limit=10,
    stop_at_most_recent_video=True,
    video_parse_step_size=25,
    time_to_sleep_between_requests=2,
    sleep_time_multiplier=2.25,
)

2024-01-20 17:22:17,009 - pipeline.download_video_metadata - INFO - Starting the DOWNLOAD VIDEO METADATA job.
2024-01-20 17:22:17,010 - pipeline.download_video_metadata - INFO - Crawling channel https://www.youtube.com/c/theneedledrop.
2024-01-20 17:22:20,028 - pipeline.download_video_metadata - INFO - Most recent video url found: https://www.youtube.com/watch?v=74NdKIdbXJc
2024-01-20 17:22:26,400 - pipeline.download_video_metadata - INFO - There aren't any videos to parse for channel https://www.youtube.com/c/theneedledrop. Exiting.


### Enrich Video Metadata


In [5]:
# Run the enrich video metadata job
run_enrich_video_metadata_job()

2024-01-20 17:22:26,503 - pipeline.enrich_video_metadata - INFO - Starting the ENRICH VIDEO METADATA job.
2024-01-20 17:22:27,845 - pipeline.enrich_video_metadata - INFO - No videos to enrich. Exiting the job...


### Downloading Audio

In [6]:
# Run the run_download_audio_job
run_download_audio_job(
    n_max_videos_to_download=1,
)

2024-01-20 17:22:27,945 - pipeline.download_audio - INFO - Starting the DOWNLOAD AUDIO job.
2024-01-20 17:22:31,242 - pipeline.download_audio - INFO - Downloading audio for 1 videos.
100%|██████████| 1/1 [00:18<00:00, 18.96s/it]
100%|██████████| 1/1 [00:09<00:00,  9.82s/it]


### Transcribing Audio

In [7]:
# Run the transcribe_audio job
run_transcribe_audio_job(
    n_max_to_transcribe=1,
)

2024-01-20 17:23:00,529 - pipeline.transcribe_audio - INFO - Starting the TRANSCRIBE AUDIO job.
2024-01-20 17:23:03,723 - pipeline.transcribe_audio - INFO - Found 1 videos whose audio we need to transcribe.
100%|██████████| 1/1 [00:01<00:00,  1.20s/it]
2024-01-20 17:23:04,925 - pipeline.transcribe_audio - INFO - Finished downloading the audio of the videos we need to transcribe.
2024-01-20 17:23:04,926 - pipeline.transcribe_audio - INFO - Starting the transcription process.
100%|██████████| 1/1 [01:10<00:00, 70.62s/it]
2024-01-20 17:24:15,551 - pipeline.transcribe_audio - INFO - Finished transcribing the audio.
2024-01-20 17:24:22,645 - pipeline.transcribe_audio - INFO - Finished the TRANSCRIBE AUDIO job.


### Embedding Audio

In [8]:
# Run the embed_transcriptions job
run_embed_transcriptions_job()

2024-01-20 17:24:26,070 - pipeline.embed_transcriptions - INFO - Found 266 individual transcription segments to embed.
2024-01-20 17:24:26,077 - pipeline.embed_transcriptions - INFO - Groupped the transcription segments into 101 segment chunks.
2024-01-20 17:24:26,077 - pipeline.embed_transcriptions - INFO - Embedding the transcription chunks...
Embedding Texts: 100%|██████████| 101/101 [00:02<00:00, 35.02it/s]
2024-01-20 17:24:28,986 - pipeline.embed_transcriptions - INFO - Finished embedding the transcription chunks.
2024-01-20 17:24:28,986 - pipeline.embed_transcriptions - INFO - Storing the embeddings in GCS...
100%|██████████| 101/101 [00:01<00:00, 63.75it/s]
