# **Uploading Previously Scraped Data**
When [I first made my prototype for Neural Needledrop](https://github.com/trevbook/neural-needle-drop-archive), I saved all of the `.mp3` files locally. Instead of re-scraping them all, I can just upload *them* to my cloud database - that way, I'll jumpstart all of my data. 

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [1]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

# Set up some envvars
%env LOG_TO_CONSOLE=True
%env LOG_LEVEL=INFO
%env TQDM_ENABLED=True

d:\data\programming\neural-needledrop\pipeline
env: LOG_TO_CONSOLE=True
env: LOG_LEVEL=INFO
env: TQDM_ENABLED=True


Now I'll import some necessary modules:

In [52]:
# General import statements
import pandas as pd
import json
from pathlib import Path
import datetime
from tqdm import tqdm

# Importing custom modules
from utils.logging import get_logger
from utils.gbq import add_rows_to_table, delete_table, create_table
from utils.gcs import upload_files_to_bucket, list_bucket_objects
from utils.settings import GBQ_PROJECT_ID, GBQ_DATASET_ID

# Loading the Data
Below, I'm going to load in all of the data that I've got. 

In [9]:
# Define the folder that contains the data
archive_data_folder = Path(
    "D:/data/programming/neural-needle-drop-archive/data/theneedledrop_scraping"
)

# Iterate through each of the files in the folder and store some information
archive_data_df_records = []
for child in tqdm(list(archive_data_folder.iterdir())):
    # If the child is not a directory itself, continue
    if not child.is_dir():
        continue

    # Extract the video ID from the folder name
    video_id = child.name
    video_url = f"https://www.youtube.com/watch?v={video_id}"

    # Identify any files within `child` that have the `.mp3` extension
    try:
        audio_file = list(child.glob("*.mp3"))[0]
    except:
        continue

    # Rename the audio file to the video ID
    audio_file.rename(child / f"{video_id}.mp3")

    # Load in the transcription JSON file
    transcription_path = child / "transcription.json"
    if not transcription_path.exists():
        continue
    with open(transcription_path) as f:
        transcription_dict = json.load(f)
        segments = transcription_dict.get("segments", [])
    if len(segments) == 0:
        continue

    # Load in the details JSON file
    details_path = child / "details.json"
    if not details_path.exists():
        continue
    with open(details_path) as f:
        details_dict = json.load(f)

    # Convert the "created at" timestamp from float to a datetime object
    created_at = datetime.datetime.fromtimestamp(child.stat().st_ctime)

    # Create a "transcription_data" list
    transcription_data = [
        {
            "url": video_url,
            "text": segment_info.get("text", None),
            "language": "en",
            "created_at": created_at,
            "segment_type": "small_segment",
            "segment_id": segment_info.get("id", None),
            "segment_seek": segment_info.get("seek", None),
            "segment_start": segment_info.get("start", None),
            "segment_end": segment_info.get("end", None),
        }
        for segment_info in segments
    ]

    # Store the data in a dataframe
    archive_data_df_records.append(
        {
            "video_id": video_id,
            "video_url": video_url,
            "created_at": created_at,
            "transcription_data": transcription_data,
            "metadata": details_dict,
            "audio_path": child / f"{video_id}.mp3",
        }
    )

# Make a dataframe from the records
archive_data_df = pd.DataFrame(archive_data_df_records)

  0%|          | 0/3974 [00:00<?, ?it/s]

100%|██████████| 3974/3974 [00:29<00:00, 134.85it/s]


# Uploading Audio to GCS
I'll start by uploading all of the `.mp3` files into GCS: 

In [None]:
# Determine which files are currently uploaded to the bucket
cur_files_uploaded = list_bucket_objects(
    bucket_name="neural-needledrop-audio", project_id=GBQ_PROJECT_ID
)
video_ids_uploaded = [file.split(".mp3")[0] for file in cur_files_uploaded]

# Determine which files have not yet been uploaded
videos_to_upload_df = archive_data_df[
    ~archive_data_df["video_id"].isin(video_ids_uploaded)
].copy()

Now that I've determined the files to upload, I'll upload them to GCS:

In [None]:
# Upload all of the data to GCS
upload_files_to_bucket(
    file_path_list=list(videos_to_upload_df["audio_path"]),
    bucket_name="neural-needledrop-audio",
    project_id=GBQ_PROJECT_ID,
    max_workers=1,
    show_progress=True,
    logger=get_logger("upload_files_to_bucket"),
)

# Editing GBQ Tables
Next, we're going to need to edit the necessary GBQ tables in order to include new data. 

### Video Metadata
First, I'm going to update the video metadata table. I'll start by downloading the current table in its entirety.

In [13]:
# Query the entire video metadata table
current_audio_files_df = pd.read_gbq(
    f"""
    SELECT *
    FROM `{GBQ_PROJECT_ID}.{GBQ_DATASET_ID}.audio`
    """
)

Next, I'm going to merge together old and new information to create a new table. 

In [33]:
# Figure out which files are not in the table
new_audio_files_df = (
    archive_data_df[
        ~archive_data_df["video_url"].isin(current_audio_files_df["video_url"].unique())
    ]
    .copy()[["video_url", "audio_path", "created_at"]]
    .rename(columns={"created_at": "scrape_date", "audio_path": "audio_gcr_uri"})
)

# Edit the columns to match the audio table
new_audio_files_df["audio_gcr_uri"] = new_audio_files_df["audio_gcr_uri"].apply(
    lambda x: f"gs://neural-needledrop-audio/{Path(x).name}"
)

Finally, we're going to just totally replace the table with all of the new data. 

In [51]:
# Convert both to the same datetime type before merging
current_audio_files_df["scrape_date"] = current_audio_files_df["scrape_date"].astype(
    "datetime64[ns]"
)
new_audio_files_df["scrape_date"] = new_audio_files_df["scrape_date"].astype(
    "datetime64[ns]"
)

# Now you can merge
merged_audio_files_df = pd.concat([current_audio_files_df, new_audio_files_df], axis=0)

# Drop any duplicates
merged_audio_files_df = merged_audio_files_df.drop_duplicates(
    subset=["video_url"], keep="last"
)

We'll now upload this table: 

In [53]:
# Create the new table
merged_audio_files_df.to_gbq(
    destination_table=f"{GBQ_DATASET_ID}.audio",
    project_id=GBQ_PROJECT_ID,
    if_exists="replace",
)

100%|██████████| 1/1 [00:00<00:00, 844.43it/s]
