# **Finalizing Pipeline**
In **`02. Prototyping Data Pipeline`**, I scoped out the entire data pipeline. Once I knew that it was running properly, I wanted to make it more configurable and contained. 

This notebook is going to be similar to that notebook, but will invoke entire configurable pipeline methods instead of multiple-cell stretches for each section. 

Each of the methods invoked will correspond with a step of the pipeline. There are a couple of different ones: 

- **Initialize Cloud Resources:** This will make sure that all of the GBQ tables & GCS buckets exist. It'll have an optional attribute for deleting *everything*. 

- **Download Video Metadata:** Next up: downloading some video metadata. This will identify which videos that a user needs to find, and then uses `pytube` to download some metadata. 

- **Enrich Video Metadata:** This step will determine what type of video each video is (album review, weekly track roundup, etc.), and extract review scores from the description

- **Downloading Video Audio:** This step will download the audio of videos we haven't downloaded yet

- **Transcribing Audio:** Next: this step uses OpenAI's Whisper model to transcribe all of the audio we've downloaded. 

- **Embedding Transcriptions:** Finally, we're going to embed some of the transcriptions that we've created using Whisper. 

Each of these methods shares a couple of key design steps: 

- Idempotency: The methods can be retried, and won't necessarily overwrite things
- Logging: Each of the methods logs information (using various logging levels) to Google Cloud Logging 
- Configurable: Different parameters of the pipeline step can be indicated through command line arguments. I've also got a way to load in these configurations via .yml files. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

d:\data\programming\neural-needledrop\pipeline


Now I'll import some necessary modules:

In [2]:
# Set up some environment variables to configure the logging 
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=DEBUG
%env TQDM_ENABLED=True

# General import statements
import pandas as pd

# Import each of the different jobs
from jobs.initialize_cloud_resources import run_initialize_cloud_resources_job
from jobs.download_video_metadata import run_download_video_metadata_job
from jobs.enrich_video_metadata import run_enrich_video_metadata_job
from jobs.download_audio import run_download_audio_job
from jobs.transcribe_audio import run_transcribe_audio_job
from jobs.embed_transcriptions import run_embed_transcriptions_job

env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=DEBUG
env: TQDM_ENABLED=True


# Running Jobs
Below, I'm going to run each of the individual jobs. 

### Initialize Cloud Resources


In [None]:
# Run the initialize resources job
run_initialize_cloud_resources_job()

### Downloading Video Metadata

In [None]:
# Run the download video metadata job
run_download_video_metadata_job(
    channel_url="https://www.youtube.com/c/theneedledrop",
    video_limit=10,
    stop_at_most_recent_video=True,
    video_parse_step_size=25,
    time_to_sleep_between_requests=2,
    sleep_time_multiplier=2.25,
    n_days_to_not_scrape=1,
)

### Enrich Video Metadata


In [None]:
# Run the enrich video metadata job
run_enrich_video_metadata_job()

### Downloading Audio

In [None]:
# Run the run_download_audio_job
run_download_audio_job(
    n_max_videos_to_download=1,
)

### Transcribing Audio

In [None]:
# Run the transcribe_audio job
run_transcribe_audio_job(
    n_max_to_transcribe=1,
)

### Embedding Audio

In [None]:
# Run the embed_transcriptions job
run_embed_transcriptions_job(max_videos_to_embed=700)

In [40]:
from utils.openai import embed_text_list, embed_text

In [41]:
embs = embed_text_list(["this is some text", "this is another piece of text, which should have a different embedding"])

Embedding Texts: 100%|██████████| 2/2 [00:00<00:00,  2.50it/s]


In [42]:
embs[0] == embs[1]

False

In [None]:
emb_1 = 