# Motivation
Now that I've played with both Pytube and OpenAI's Whisper, I think I should be able to start building out a pipeline that'll download some data about Anthony Fantano videos. 

# Setup
The cells below will help to set up the rest of the notebook.

I'll start by changing directories to the root of the repo. 

In [1]:
# Change the directory to the root of the repo
%cd ..

C:\Data\Personal Study\Programming\neural-needle-drop


Next, I'll import a couple of different libraries.

In [2]:
# Import statements
import subprocess
import whisper
from pathlib import Path
from time import time
import torch
import json
from Levenshtein import ratio
from tqdm import tqdm
from time import sleep
from random import randint
import os

# pytube-specific import statements
from pytube import YouTube
from pytube import Channel

Finally, I'll set up Whisper by loading the model.

In [3]:
# Determining whether we'll use GPU or CPU
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load in the model
model_type = "tiny.en"
start_time=time()
whisper_model = whisper.load_model(model_type, device=DEVICE)
print(f"Loaded the '{model_type}' model into {DEVICE} in {time()-start_time:.2f} seconds")

Loaded the 'tiny.en' model into cuda in 1.84 seconds


# Methods
Below, I'm going to try and build up a couple of methods that're helpful for building a pipeline. Eventually, if I were to take this out of a Jupyter Notebook, it would behoove me to have a lot of this code compartmentalized. 

In [4]:
# This method will retrieve all of the video URLs for a particular channel
def get_channel_video_urls(channel):
    
    # Creating a Channel object
    channel_object = Channel(f'https://www.youtube.com/c/{channel}')

    # Determining all of his videos
    all_video_urls = channel_object.video_urls
    return all_video_urls

# This method will convert an m4a file to mp3
def convert_m4a_to_mp3(input_file_path, output_file_path):
    
    # Import the necessary library
    import subprocess
    
    # Generate the ffmpeg command we'll use
    command = f"""ffmpeg -i {input_file_path} {output_file_path}"""

    # Run the command
    rsp = subprocess.run(command)

In [5]:
# This method will attempt to download the linked YouTube video and save an .mp3 at a particular path
def download_video_mp3(video_url, output_file_path):
    
    # Creating a YouTube object for this video
    video = YouTube(video_url)

    # Find the highest-bitrate .m4a audio stream
    highest_bitrate_mp4_audio_stream = None
    highest_bitrate_found = 0
    for stream in video.streams.filter(only_audio=True):
        if (stream.mime_type == "audio/mp4"):
            stream_bitrate = int(stream.abr.split("kbps")[0])
            if (stream_bitrate > highest_bitrate_found):
                highest_bitrate_mp4_audio_stream = stream
                highest_bitrate_found = stream_bitrate

    # Create a data/ folder if it doesn't exist
    data_folder_path = Path(output_file_path.parent)
    data_folder_path.mkdir(exist_ok=True, parents=True)

    # Download the audio for this YouTube video 
    highest_bitrate_mp4_audio_stream.download(output_path=data_folder_path, 
                                              filename=output_file_path.stem+".m4a",
                                              skip_existing=False)

    # Now, convert this audio to .mp3, and then delete the .m4a file
    convert_m4a_to_mp3(input_file_path=f"{data_folder_path}/{output_file_path.stem}.m4a", 
                       output_file_path=output_file_path)

    # Remove the file
    os.remove(f"{data_folder_path}/{output_file_path.stem}.m4a")
    
# This method will return a JSON of a particular video's video_details
def get_video_details(video_url):
    
    # Convert the video to a YouTube object
    video = YouTube(video_url)
    
    # Get the details for this video
    video_details = video.vid_info["videoDetails"]
    
    # Add the publication date to the video details
    video_details["publish_date"] = video.publish_date
    
    # Return the video details
    return video_details

In [6]:
# This method will transcribe an mp3 and return the transcription as a dictionary
def transcribe_mp3(mp3_file_path):
    return whisper_model.transcribe(str(mp3_file_path))

In [7]:
# This method will return the "Watch URL" of a particular video when given the Video ID
def generate_watch_url(videoId):
    return f"https://www.youtube.com/watch?v={videoId}"

# Experimenting
First thing I want to try: download a **ton** of the NeedleDrop videos. When I say "download", I want to try and grab: 

- The audio associated with the video
- The details associated with the video
- The transcription of the video 

In theory, this ought to be enough to use for future experiments. I could tweak the format of how I'm saving each of these, but ultimately, this will be a "proof of concept" for the workflow of an on-demand Fantano video scraper. 

**NOTE:** It's unclear as to whether or not YouTube really likes the fact that you're scraping videos from them, so it'll be important to be mindful of this when trying to access information. [Make sure to rate limit.](https://github.com/pytube/pytube/issues/97) In the future, I'll try and *really* slow down the rate at which I'm scraping this information, seeing as I don't want to get IP-banned. (I should also try to use a VPN when doing this.)

In [8]:
# Grabbing all of the video URLs for theneedledrop
needledrop_video_urls = get_channel_video_urls("theneedledrop")

# Determine a minimum sleep time 
min_sleep_time = 20

# Iterate through the first couple videos within Anthony Fantano's URL list and 
# download some of their information
for video_url in tqdm(needledrop_video_urls):
    
    # Determine where to save the contents of this loop by extracting the video ID 
    video_id = video_url.split("=")[-1]
    main_output_path = Path(f"data/theneedledrop_scraping/{video_id}/")
    main_output_path.mkdir(exist_ok=True, parents=True)
    mp3_output_path = Path(f"{main_output_path}/audio.mp3")
    details_output_path = Path(f"{main_output_path}/details.json")
    transcription_output_path = Path(f"{main_output_path}/transcription.json")
    
    # Don't scrape this file if these already exist
    if (mp3_output_path.exists() and details_output_path.exists() and transcription_output_path.exists()):
        continue
    
    # Try to grab the video details
    try:
        
        # Grab the video details, and then sleep a random amount of time 
        video_details = get_video_details(video_url)
        sleep(randint(min_sleep_time, min_sleep_time+5))
        
        # Save the video details
        with open(details_output_path, "w") as json_file:
            json.dump(video_details, json_file, indent=2)
        
    # If you run into some sort of error, skip this video
    except:
        print(f"Ran into an error when scraping the video details for {video_url}.\nSleeping for a while and continuing...\n")
        sleep(randint(65, 75))
        continue
        
    # Try to download the video MP3
    try:
        
        # Download the MP3 audio for this video
        download_video_mp3(video_url, mp3_output_path)
        sleep(randint(min_sleep_time, min_sleep_time+5))
    
    # If you run into some sort of error, skip this video
    except:
        print(f"Ran into an error when scraping the video details for {video_url}.\nSleeping for a while and continuing...\n")
        sleep(randint(65, 75))
        continue

    # Transcribe the audio
    transcription = transcribe_mp3(mp3_output_path)
    
    # Save the transcription
    with open(transcription_output_path, "w") as json_file:
        json.dump(transcription, json_file, indent=2)

 36%|███████████████████████████████████▎                                                               | 1419/3974 [08:08<08:31,  5.00it/s]

Ran into an error when scraping the video details for https://www.youtube.com/watch?v=N65_VUOVT6o.
Sleeping for a while and continuing...



 71%|██████████████████████████████████████████████████████████████████████▋                            | 2836/3974 [13:33<06:04,  3.12it/s]

Ran into an error when scraping the video details for https://www.youtube.com/watch?v=hwEA9vhpQDA.
Sleeping for a while and continuing...



 72%|███████████████████████████████████████████████████████████████████████▎                           | 2861/3974 [15:10<09:40,  1.92it/s]

Ran into an error when scraping the video details for https://www.youtube.com/watch?v=EC9-QwUEaoc.
Sleeping for a while and continuing...



 84%|███████████████████████████████████████████████████████████████████████████████████                | 3334/3974 [16:59<02:55,  3.64it/s]

Ran into an error when scraping the video details for https://www.youtube.com/watch?v=K-J6VwasuOg.
Sleeping for a while and continuing...



 91%|██████████████████████████████████████████████████████████████████████████████████████████▎        | 3625/3974 [18:22<01:42,  3.39it/s]

Ran into an error when scraping the video details for https://www.youtube.com/watch?v=HzUdHq42Z6A.
Sleeping for a while and continuing...



100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3974/3974 [20:54<00:00,  3.17it/s]
