# Motivation
In another notebook (**TheNeedleDrop Scraping**), I defined a process to download data about each TheNeedleDrop review. Now, I want to create a couple of additional methods that create "enriched" versions of the files I've downloaded. 

Mostly, these "enriched" stats files will contain some important information about the video - description, publication date, album / artist information, etc. I'll try and parse the scores from the descriptions, and determine if videos are reviews or not. 

# Setup
The cells below will help to set up the rest of the notebook.

First, I'll change my working directory to the repo's root. 

In [None]:
# Changing the cwd to the repo root
%cd ..

Next, I'll import some important modules.

In [None]:
# Import statements
import pandas as pd
import json
from pathlib import Path
from tqdm import tqdm
import re
import plotly.express as px
from time import sleep
from Levenshtein import ratio
import math

# pytube-specific import statements
from pytube import YouTube
from pytube import Channel

# Import statements
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

We're also going to set up the Spotify client. We'll use this later on! 

In [None]:
# Setting up the API client
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

# Method Development
I'm going to try and develop a number of "helper methods" that help to enrich the different files. 

### Detecting Reviews
One method that I want to try and write: a "review" detector. Fantano releases a ton of different videos that aren't reviews - things like "top ____ of the year" sorta stuff.

The method below will try to use a video's title / description to determine if a video is a review or not.

In [None]:
# This method will attempt to determine whether a video is a review or not 
def detect_review(video_details_dict):
    
    # Determine if "album review" is a substring of the video title, and return a bool indicating that
    if video_details_dict["title"].lower().find("album review") == -1:
        return False
    else:
        return True

### Classifying Videos
Instead of just detecting if a video is an album review, we can write a more sophisticated "video classifier". This will search the titles to determine what type of video a particular video is. 

In [None]:
# This method will attempt to determine what type of video a particular video is 
def classify_video_type(video_details_dict):
    
    # Transform the video title to a lowercase string
    lowercase_video_title = video_details_dict["title"].lower()
    
    # Try and determine what type of video this one is by parsing the title 
    if lowercase_video_title.find("album review") != -1 and lowercase_video_title.count("-") > 0:
        return "album_review"
    elif lowercase_video_title.find("ep review") != -1:
        return "ep_review"
    elif lowercase_video_title.find("track review") != -1:
        return "track_review"
    elif lowercase_video_title.find("mixtape review") != -1:
        return "mixtape_review"
    elif lowercase_video_title.find("yunoreview") != -1 or lowercase_video_title.find("y u no review") != -1:
        return "yunoreview" 
    elif lowercase_video_title.find("weekly track roundup") != -1 or lowercase_video_title.find("best & worst tracks") != -1:
        return "weekly_track_roundup"
    elif lowercase_video_title.find("tnd podcast #") != -1:
        return "tnd_podcast"
    elif lowercase_video_title.find("vinyl update") != -1:
        return "vinyl_update"
    else:
        return "misc"

### Extracting Review Scores
The next method will attempt to use a regular expression to extract the review score from a video's description.

In [None]:
# This method will try to extract the review score from a video's description using a regex. 
# If successful, it'll return an int. If unsuccessful, it'll return None.
def extract_review_score(video_details_dict):
    
    # Try to parse the review score from the video's description
    try:
        video_description = video_details_dict["shortDescription"]
        search = re.findall(r'[^0-9][0-9]{1,2}/10', video_description, re.IGNORECASE)
        return int(search[-1].split("/")[0])
    
    # Return None if we ran into an error
    except Exception as e:
        return None

### Extracting Album Information
For the videos that're identified as album reviews, we can determine the artist and album title that're associated with them. 

In [None]:
# This method will try and extract the album title and artist name from a review's title
def extract_album_info(video_details_dict):
    try: 
        video_title = video_details_dict['title'].lower()
        single_dash = video_title.count("-") == 1
        if (single_dash):
            artist, album_title = [x.strip() for x in video_title.split("album review")[0].strip().split("-")]
        else:
            video_title = video_title.split("album review")[0]
            artist, album_title = [x.strip() for x in re.split(r'\s*-\s*', video_title, maxsplit=1)]
        return {"artist": artist, "album_title": album_title}
    except:
        return {"artist": None, "album_title": None}

### Searching Spotify for Album Info
I can use the Spotify API (via [spotipy](https://spotipy.readthedocs.io/en/2.22.0/#getting-started)) to search for information about each of the albums. 

In [None]:
def search_spotify_album_id(album_title, artist):
    
    # Search Spotify for a particular album
    try:
        search_str = f"{album_title} {artist}".lower()
        search_res = spotify.search(search_str, limit=1, type='album')
        sleep(1)

        # Extract some information from this Spotify search result
        album_id = search_res["albums"]["items"][0]['id']
        spotify_res_artist = search_res["albums"]["items"][0]["artists"][0]["name"]
        spotify_res_album_title = search_res["albums"]["items"][0]["name"]
        spotify_res_search_str = f"{spotify_res_album_title} {spotify_res_artist}".lower()

        # Determine how similar the result was to the search string 
        lev_sim = ratio(spotify_res_search_str.lower(), search_str)

        # If the result is above a particular similarity, we're going to return that information
        if (lev_sim >= 0.8):
            return album_id
        else:
            return None
        
    # If we run into an Exception, return None 
    except:
        return None

In addition to getting the album IDs for these albums, we'll want a method to extract the album information from Spotify. The cells below will create those methods. 

In [None]:
# This method will parse a Spotify album info dict
def parseAlbumInfo(res):
    
    # We're going to store the results in this albumInfo dict
    albumInfo = {}

    # Indicate which fields we're looking to grab
    fieldsToGrab = ["album_type", "external_ids", "external_urls", "genres", "href", "id", "images",
                    "label", "name", "popularity", "release_date", "release_date_precision",
                    "total_tracks", "type", "uri"]
    for field in fieldsToGrab:
        albumInfo[field] = res.get(field)

    # Parse the tracks dict a little more
    total_album_ms = 0
    if "tracks" in res:

        # Grab some fields about the track
        albumInfo["track_count"] = res["tracks"].get("total")

        # Iterate through each of the tracks in the album and grab some information about each
        trackList = []
        for track in res["tracks"]["items"]:
            curTrackInfo = {}
            track_fieldsToGrab = ["duration_ms", "id", "name"]
            for trackField in track_fieldsToGrab:
                curTrackInfo[f"track_{trackField}"] = track.get(trackField)

                if (trackField == "duration_ms" and track.get(trackField) is not None):
                    total_album_ms += track.get(trackField)

            trackList.append(curTrackInfo)
        albumInfo["tracks"] = trackList
        albumInfo["duration_ms"] = total_album_ms

    # Parse the artists dict a little more
    if "artists" in res:
        artistList = []
        for artist in res["artists"]:
            artistInfo = {}
            artist_fieldsToGrab = ["href", "id", "name", "type", "uri"]
            for artistField in artist_fieldsToGrab:
                artistInfo[artistField] = artist.get(artistField)
            artistList.append(artistInfo)
        albumInfo["artists"] = artistList

    # Parse the copyrights list a little more
    if "copyrights" in res:
        copyright_list = []
        for copy in res["copyrights"]:
            copyInfo = {}
            for key, val in copy.items():
                copyInfo[f"copyright_{key}"] = val
            copyright_list.append(copyInfo)
        albumInfo["copyright"] = copyright_list

    # Return the albumInfo dict
    return albumInfo

# This method will extract information for multiple albums
def albumInfoSpotify_multipleAlbums(albumID_list):

    # Return the Spotify results
    return spotify.albums(albumID_list)

# This will return a list of the raw / parsed album data for a set of albums
def spotify_albums(albumID_list):
    
    # Break this list up into chunks of 20 albums each
    master_chunk_results = []
    chunk_amt = math.ceil(len(albumID_list)/20)
    for cur_chunk in tqdm(list(range(chunk_amt))):
        list_chunk = albumID_list[(cur_chunk*20):((cur_chunk+1)*20)]
        
        # Parse the information for this chunk
        albumInfo_list = albumInfoSpotify_multipleAlbums(list_chunk)
        parsed_album_info = []
        for albumInfo in albumInfo_list["albums"]:
            parsedInfo = parseAlbumInfo(albumInfo)
            parsed_album_info.append({"raw": albumInfo, "parsed": parsedInfo})
        master_chunk_results += parsed_album_info
        sleep(5)
    
    # Return all of the albums we'd parsed
    return master_chunk_results

### Searching Spotify for Artist Info
I also want to search Spotify for the artist info. 

In [None]:
def search_spotify_artist_id(artist):
    
    # Search Spotify for a particular album
    try:
        search_str = f"{artist}".lower()
        search_res = spotify.search(search_str, limit=1, type='artist')
        sleep(1)

        # Extract some information from this Spotify search result
        artist_id = search_res["artists"]["items"][0]["id"]
        spotify_res_artist = search_res["artists"]["items"][0]["name"]
        spotify_res_search_str = f"{spotify_res_artist}".lower()

        # Determine how similar the result was to the search string 
        lev_sim = ratio(spotify_res_search_str.lower(), search_str)

        # If the result is above a particular similarity, we're going to return that information
        if (lev_sim >= 0.8):
            return artist_id
        else:
            return None
        
    # If we run into an Exception, return None 
    except:
        return None

def artistInfoSpotify_multipleArtists(artistID_list):

    # Return the Spotify results
    return spotify.artists(artistID_list)

def parseArtistInfo(res):
    # We're going to store the results in this artistInfo dict
    artistInfo = {}

    # Indicate which fields we're looking to grab
    fieldsToGrab = ["genres", "href", "id", "images", "name", "popularity", "type", "uri"]
    for field in fieldsToGrab:
        artistInfo[field] = res.get(field)

    # Parse the external_urls dict a little more
    if ("external_urls" in res):
        for service, url in res["external_urls"].items():
            artistInfo[f"{service}_url"] = url

    # Parse the followers dict a little more
    if ("followers" in res):
        for key, val in res["followers"].items():
            artistInfo[f"followers_{key}"] = val

    return artistInfo

# This will return raw/parsed artist data for a particular list of artists
def spotify_artists(artistID_list):
    
    # Break up the list into chunks
    master_chunk_results = []
    chunk_amt = math.ceil(len(artistID_list)/50)
    for cur_chunk in tqdm(list(range(chunk_amt))):
        list_chunk = artistID_list[(cur_chunk*50):((cur_chunk+1)*50)]
        
        # Parse the information for this chunk
        artistInfo_list = artistInfoSpotify_multipleArtists(list_chunk)
        parsed_artist_info = []
        for artistInfo in artistInfo_list["artists"]:
            parsedInfo = parseArtistInfo(artistInfo)
            parsed_artist_info.append({"raw": artistInfo, "parsed": parsedInfo})
        master_chunk_results += parsed_artist_info
        sleep(5)
    
    # Return all of the artists we'd parsed
    return master_chunk_results

# Testing Methods
Now that I've defined a number of the methods above, I want to determine some information about the different videos I'd downloaded. 

I'll start by creating a DataFrame of all of the videos I'd downloaded. 

In [None]:
# Create a DataFrame containing all of the data scraped for each of the videos
tnd_data_df_records = []
for child_dir in tqdm(list(Path("data/theneedledrop_scraping/").iterdir())):
    
    # Extract the video ID from the 
    cur_video_id = child_dir.name
    
    # Load in the details.json file
    try:
        with open(f"data/theneedledrop_scraping/{cur_video_id}/details.json", "r") as json_file:
            cur_details_dict = json.load(json_file)
    except:
        cur_details_dict = {}
        
    # Load in the transcription.json file
    try:
        with open(f"data/theneedledrop_scraping/{cur_video_id}/transcription.json", "r") as json_file:
            cur_transcription_dict = json.load(json_file)
    except:
        cur_transcription_dict = {}
        
    # Create a "record" for this video
    tnd_data_df_records.append({
        "video_id": cur_video_id,
        "details_dict": cur_details_dict,
        "transcription_dict": cur_transcription_dict
    })
    
# Now, we want to create a DataFrame from the tnd_data_df_records
tnd_data_df = pd.DataFrame.from_records(tnd_data_df_records)

With this collected, I'm going to start applying the different methods I've written to try and determine some info about the different videos. 

### Detecting Reviews
I've got two main questions for this section: 

1. What percentage of the videos are reviews? 
2. What are some of the titles that *aren't* reviews? (Knowing this could help with more "enrichment", since I'd be able to classify additional kinds of his videos.)

In [None]:
# Add a column to the tnd_data_df that indicates whether a video is a review or not 
tnd_data_df["is_review"] = tnd_data_df["details_dict"].apply(lambda x: detect_review(x))

# Show value counts for the "is_review" column
tnd_data_df["is_review"].value_counts()

What are some of the titles of videos that aren't reviews? 

In [None]:
# Print a couple of non-review video titles
for row in tnd_data_df.query("is_review==False")["details_dict"].apply(
    lambda x: x['title'] if 'title' in x else None).head(10):
    print(row)

### Classifying Videos

In [None]:
tnd_data_df["video_type"] = tnd_data_df["details_dict"].apply(
    lambda x: classify_video_type(x))

tnd_data_df["video_type"].value_counts()

In [None]:
# Print a couple of non-review video titles
for row in tnd_data_df.query("video_type=='misc'")["details_dict"].apply(
    lambda x: x['title'] if 'title' in x else None).head(60):
    print(row)

### Extracting Review Scores
For those videos that *are* reviews: what's the score distribution? 

In [None]:
# Create a DataFrame consisting of exclusively album review videos
tnd_data_df_review_subset = tnd_data_df.query("video_type=='album_review'").copy()
tnd_data_df_review_subset["review_score"] = tnd_data_df_review_subset["details_dict"].apply(
    lambda x: extract_review_score(x))

# Create a visualization showing the distribution of scores 
score_value_count_df = tnd_data_df_review_subset["review_score"].value_counts().reset_index().rename(
    columns={"index": "score", "review_score": "ct"}).sort_values("score", ascending=True).copy()
fig = px.histogram(tnd_data_df_review_subset.query("review_score<=10"), x="review_score")
fig.show()

### Extracting Album Information

In [None]:
tnd_data_df_review_subset["album_info_dict"] = tnd_data_df_review_subset["details_dict"].apply(
    lambda x: extract_album_info(x))

### Searching Spotify for Album Info
Next, I want to try and get all of the Spotify album IDs for each of the albums within the NeedleDrop data. 

In [None]:
spotify_album_id_df_records = []
for row in tqdm(list(tnd_data_df_review_subset.itertuples())):
    spotify_album_id_df_records.append({
        "video_id": row.video_id,
        "spotify_album_id": search_spotify_album_id(row.album_info_dict["album_title"], 
                                                    row.album_info_dict["artist"])
    })
    sleep(2.5)
spotify_album_id_df = pd.DataFrame.from_records(spotify_album_id_df_records)
spotify_album_id_df.to_json("data/spotify_scraping/video_spotify_album_linkages.json", indent=2, orient="records")
spotify_album_id_df = pd.read_json("data/spotify_scraping/video_spotify_album_linkages.json")

Now, quickly, I want to scrape Spotify for some album information. In theory, I'm going to do this in a much more sophisticated way once I have a full "pipeline" to collect this data, but for now, I'll just do it all at once.  

In [None]:
album_id_list = [x for x in list(spotify_album_id_df["spotify_album_id"]) if x is not None]
album_info_results = spotify_albums(album_id_list)

Now, I'll save all of this data. 

In [None]:
spotify_album_data_path = Path("data/spotify_scraping/albums")
spotify_album_data_path.mkdir(exist_ok=True, parents=True)
for album_info_res in tqdm(album_info_results):
    cur_album_id = album_info_res["parsed"]["id"]
    with open(f"{spotify_album_data_path}/{cur_album_id}.json", "w") as json_file:
        json.dump(album_info_res["parsed"], json_file, indent=2, default=str)

As a part of saving all of this data, I'm going to save something of an "artist index" in the spotify scraping folder. This will ensure that I've got an easy map of `[album title, artist name]` --> `[album id]`.

In [None]:
album_index_df_records = []
merged_album_df = spotify_album_id_df.merge(
    tnd_data_df_review_subset, how="right", on="video_id").copy()
for row in merged_album_df.itertuples():
    if (row.spotify_album_id is not None):
        album_search_str = f"{row.album_info_dict['album_title']} {row.album_info_dict['artist']}"
        album_index_df_records.append({
            "album_title": row.album_info_dict['album_title'],
            "artist": row.album_info_dict['artist'],
            "album_search_str": album_search_str,
            "spotify_album_id": row.spotify_album_id})
album_index_df = pd.DataFrame.from_records(album_index_df_records)
album_index_df.to_json("data/spotify_scraping/album_index.json", orient="records", indent=2)

### Searching Spotify for Artist Info
Another thing I'm interested in doing: searching Spotify for the artist info for each of the different artists. 

In [None]:
artist_list = list(set([x for x in list(tnd_data_df_review_subset["album_info_dict"].apply(lambda x: x['artist'] if 'artist' in x else None)) if x is not None]))
spotify_artist_result_dict = {}
for artist in tqdm(artist_list):
    spotify_artist_result_dict[artist] = search_spotify_artist_id(artist)
    sleep(3)

In [None]:
# Add an "artist name" column to the 
tnd_data_df_review_subset["artist_name"] = tnd_data_df_review_subset["album_info_dict"].apply(lambda x: x['artist'] if 'artist' in x else None)

# Make a DataFrame out of the spotify_artist_result_dict
spotify_artist_result_df = pd.DataFrame.from_records([{"artist_name": key, "spotify_artist_id": val} for key, val in spotify_artist_result_dict.items()])

# Make a 'merged' version with the artist ID 
tnd_data_df_review_subset_merged = tnd_data_df_review_subset.merge(spotify_artist_result_df, how="left", on="artist_name").copy()

Next, we want to scrape Spotify for the artist information associated with each of the different artists. 

In [None]:
all_artist_list = list(set([x for x in list(tnd_data_df_review_subset_merged["spotify_artist_id"]) if x is not None]))
all_artist_info = spotify_artists(all_artist_list)

Next, I'll save all of this data. 

In [None]:
spotify_artist_data_path = Path("data/spotify_scraping/artists")
spotify_artist_data_path.mkdir(exist_ok=True, parents=True)
for artist_info_res in tqdm(all_artist_info):
    cur_artist_id = artist_info_res["parsed"]["id"]
    with open(f"{spotify_artist_data_path}/{cur_artist_id}.json", "w") as json_file:
        json.dump(artist_info_res["parsed"], json_file, indent=2, default=str)

Finally, I'll make something of an "index" for this data. 

In [None]:
spotify_artist_index_df = tnd_data_df_review_subset_merged[
    ["artist_name", "spotify_artist_id"]].drop_duplicates()
spotify_artist_index_df = spotify_artist_index_df[spotify_artist_index_df["spotify_artist_id"].notna()]
spotify_artist_index_df.to_json("data/spotify_scraping/artist_index.json", orient="records", indent=2)

# Main Method
Now that I've developed a number of "enrichment" methods, I'm going to develop one main one: a method that enriches a single video when provided with the `details.json` dictionary. 

In [None]:
# We're also going to assume that we've loaded in the Spotify indices
spotify_indices = {}
for index_type in ["album", "artist", "song"]:
    index_path = Path(f"data/spotify_scraping/{index_type}_index.json")
    if (index_path.exists()):
        with open(index_path, "r") as json_file:
            spotify_indices[index_type] = pd.DataFrame(json.load(json_file))
            
# Loading in the video_spotify_linkages
video_spotify_linkages = {}
for linkage_type in ["album", "artist", "song"]:
    linkage_path = Path(f"data/spotify_scraping/video_spotify_{linkage_type}_linkages.json")
    if (linkage_path.exists()):
        with open(linkage_path, "r") as json_file:
            video_spotify_linkages[linkage_type] = pd.DataFrame(json.load(json_file))

# This method will enrich a video's details_dict
def enrich_video_details(input_details_dict):
    
    # Set up the enriched_details_dict
    enriched_details_dict = {}

    # ===================================
    # DUPLICATING THE DETAILS_DICT
    # ===================================

    # Add all of the keys from the input_details_dict to this new enriched_details_dict
    for key, val in input_details_dict.items():
        enriched_details_dict[key] = val

    # ===================================
    # INFERRING VIDEO INFORMATION
    # ===================================

    # First, we'll determine what kind of video this is
    try: 
        inferred_video_type = classify_video_type(input_details_dict)
        enriched_details_dict["inferred_video_type"] = inferred_video_type
    except:
        enriched_details_dict["inferred_video_type"] = None

    # If the video is an album review, we're going to try and detect the score
    try:
        if (inferred_video_type == "album_review"):
            inferred_review_score = extract_review_score(input_details_dict)
            enriched_details_dict["inferred_review_score"] = inferred_review_score
    except:
        enriched_details_dict["inferred_review_score"] = None

    # ===================================
    # CREATING SPOTIFY LINKAGES
    # ===================================

    # Setting up the Spotify linkages dictionary
    spotify_linkages_dict = {'album': [], 'artist': [], 'song': []}

    # If the album is an album review, we're going to check if we've scraped the album data
    if (enriched_details_dict["inferred_video_type"] == "album_review"):

        # Check if the album has had information scraped
        video_linkage_df_query = video_spotify_linkages['album'].query("video_id==@cur_video_id")
        if (len(video_linkage_df_query) > 0):
            spotify_linkages_dict['album'].append({"review_album": video_linkage_df_query.iloc[0]['spotify_album_id']})

        # TODO - Instead of just checking the artist_index.json, we ought to try and 
        # actually *scrape* Spotify for data (if this is an unseen artist / album / song etc.)

    # Now that we're finished adding the Spotify linkages, we can 
    enriched_details_dict["spotify_linkages"] = spotify_linkages_dict
    
    # Return the enriched details
    return enriched_details_dict

### Enriching All Videos
Now that I've written the details enrichment method, I should be able to enrich all of the videos. 

In [None]:
# Iterate through all of the video folders
for video_folder_path in tqdm(list(Path("data/theneedledrop_scraping/").iterdir())):
    if (video_folder_path.is_dir()):
        cur_video_id = video_folder_path.stem
        
        # Load in the details dictionary 
        cur_video_folder_path = Path(f"data/theneedledrop_scraping/{cur_video_id}/")
        cur_video_details_path = Path(f"{cur_video_folder_path}/details.json")
        if (cur_video_details_path.exists()):
            with open(cur_video_details_path, "r") as json_file:
                cur_video_details_dict = json.load(json_file)
        else:
            cur_video_details_dict = None
        
        # If the details dictionary was successfully loaded, we can enrich it and save the result
        if (cur_video_details_dict is not None):
            
            # Enrich the details dictionary
            enriched_details_dict = enrich_video_details(cur_video_details_dict)
            
            # Now, save this enriched details dictionary
            with open(f"{cur_video_folder_path}/enriched_details.json", "w") as json_file:
                json.dump(enriched_details_dict, json_file, indent=2)