In [127]:
import json
import os
import re
import sys
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Add the project's root directory to the Python path.
# This helps the notebook find custom modules which are located relative to the project root.
package_root_dir = os.path.join(os.getcwd(), "..", "..")
data_dir = os.path.join(package_root_dir, "data")
raw_data_dir = os.path.join(package_root_dir, "data", "raw")
aoty_data_file_path = os.path.join(package_root_dir, "data", "raw", "aoty_htmls")
processed_data_dir = os.path.join(package_root_dir, "data", "processed")
sys.path.append(package_root_dir)

### <u>UPDATE</u>: This notebook is now archived. [https://melondy.com/](https://melondy.com/) exists and I was unaware of its existence at the time of writing this code :( . At least I learned about scraping in the process. It also makes it substantially easier to run a timed script to get all of the new reviews to create consistent datasets in the future.

# ALL Needledrop Scores

It has been a longstanding task on the internet to document all of Anthony Fantano's scores on albums. It's a Herculean task given that he has been around for over a decade offering a surprisingly consistent rating system over the years. 

A few sources have done an incredible job! Notably, the [Album of the Year](https://www.albumoftheyear.org/ratings/57-the-needle-drop-highest-rated/all) is STILL maintained by Fantano himself. Unfortunately here, he just multiplies the raw number by 10 and calls it a day. None of his "light, decent, strong" reflections get highlighted in the final scores. The [Genius list](https://genius.com/Anthony-fantano-every-anthony-fantano-review-annotated) attempts to rectify this wrong, but with only so many contributors, the project maintenance seems to have lost steam at the tail end of 2024.

What we are attempting to do here will hopefully allow us to record Fantano's reviews more consistently and in real time. Admittedly, the MOST consistent thing to do would be to run a server that constantly checks if Anthony uploaded a new video, scrape the transcript from youtube if there is one, and extract the rating from the video (if there is one). The issue is, youtube transcripts are not 100% reliable. Therefore, so long as Anthony continues to maintain [his own website](https://theneedledrop.com/album-reviews/), we can actually scrape the website for the script he uses in the video!

One might be thinking that "Strong 7 to Light 8" can't be numerically stored, but I found a [very creative system](https://pastebin.com/Z6QnbTVn) by yet [another attempted maintenance of all Fantano scores](https://www.albumoftheyear.org/user/jonikles/list/53273/anthony-fantanos-all-scores/) that maps all possible scores to a number 0-100. As it turns out, there are 91 categories and 101 values. So, the system omits any score that has a 9 in the ones digit. The system claims that a jump from "Strong K to Light K+1" to "Light K+1" is a more definitive stance from Fantano of the final numerical value he lands on and (at least from a vibe/sentiment perspective), I can get behind this labelling of scores.

## Fantano Website Scraping

So, the `scraper.fantano_website_scraper` Python module runs a script that takes a configuration of URLs to match. We scrape all album reviews off of Fantano's website. The idea is we will plug in the final paragraph(s) of the script on each webpage into an OpenAI prompt that determines what Fantano ranked it under the creative system we described above.

In [4]:
with open(os.path.join(raw_data_dir, "fantano_website_data.json"), "r", encoding="utf-8") as f:
    scraped_data = json.load(f)

Let's examine the first review in the data!

In [5]:
scraped_data[1] # scraped_data[0] is the home page.

{'url': 'https://theneedledrop.com//album-reviews/earl-sweatshirt-live-laugh-love-album-review/',
 'html': '\nHi, everyone. Livethony Laughtano here, the internet\'s busiest music nerd. It\'s time for a review of this new Earl Sweatshirt album, Live Laugh Love. Here we have the fifth full-length commercial LP from rapper and songwriter Mr. Earl Sweatshirt, a man who is the closest thing I would say that we have right now to a modern hip hop folk hero. Because sure, you could categorize Earl as one of many underground rappers operating out there today, but that would not be his full story. Even when rewinding to his beginnings in the music collective, Odd Future, you still wouldn\'t get the full picture. Because Earl was one of many names to come out of that greater Odd Future universe. And yet he hasn\'t evolved to become the mainstream-ready conceptualist that Tyler, the Creator is, nor did he reset the R&B landscape with just a couple of records just like Frank Ocean did. So yeah, wh

That's an incredibly long piece of text to plug into OpenAI just to extract a single code. Let's split on the newline character and extract the final 100 characters.

In [6]:
scraped_data[1]['html'][-100:]

" which is why I'm feeling a strong 7 to a light 8 on it. Anthony Fantano, Earl Sweatshirt, Forever.\n"

Seems like we stripped away a lot of irrelevant text and even the worst AI completion should be able to map this "Strong 7 to Light 8". But perhaps we got lucky! Let's try the same approach on the first 15 reviews in the list.

In [7]:
for i in range(1, 16):
    print({scraped_data[i]['url']: scraped_data[i]['html'][-100:]})


{'https://theneedledrop.com//album-reviews/earl-sweatshirt-live-laugh-love-album-review/': " which is why I'm feeling a strong 7 to a light 8 on it. Anthony Fantano, Earl Sweatshirt, Forever.\n"}
{'https://theneedledrop.com//album-reviews/': ''}
{'https://theneedledrop.com//album-reviews/deftones-private-music-album-review/': "hich is why I'm feeling about a decent to strong 7 on this one. Anthony Fantano, Deftones, Forever.\n"}
{'https://theneedledrop.com//album-reviews/laufey-a-matter-of-time-album-review/': ", which is why I'm feeling a strong 5 to a light 6 on this album. Anthony Fantano, Laufey, forever.\n"}
{'https://theneedledrop.com//album-reviews/mgks-lost-americana-is-not-good/': " so shallow, and are just not captivating at all.Unfortunately, this new mgk record, it's not good.\n"}
{'https://theneedledrop.com//album-reviews/ninajirachi-i-love-my-computer-album-review/': "ene, which is why I'm feeling about a light 8 on this album. Anthony Fantano, Ninajirachi. Forever.\n"}
{

Looks like on the average, the endings of the album reviews are similar, but we have some unnecessary URLs and some posts that don't actually contain a score. We can fix this because---luckily---any actual review with a score seems to end in "album-review/".

In [8]:
from utils.data_utils import process_scraped_data
processed_data = process_scraped_data(scraped_data)

In [9]:
for i in range(1, 16):
    print({processed_data[i]['url']: processed_data[i]['html'][-100:]})


{'https://theneedledrop.com//album-reviews/deftones-private-music-album-review/': "hich is why I'm feeling about a decent to strong 7 on this one. Anthony Fantano, Deftones, Forever.\n"}
{'https://theneedledrop.com//album-reviews/laufey-a-matter-of-time-album-review/': ", which is why I'm feeling a strong 5 to a light 6 on this album. Anthony Fantano, Laufey, forever.\n"}
{'https://theneedledrop.com//album-reviews/ninajirachi-i-love-my-computer-album-review/': "ene, which is why I'm feeling about a light 8 on this album. Anthony Fantano, Ninajirachi. Forever.\n"}
{'https://theneedledrop.com//album-reviews/chance-the-rapper-star-line-album-review/': "s why I'm feeling a strong 6 to a light 7 on this one. Anthony Fantano, Chance the Rapper, forever.\n"}
{'https://theneedledrop.com//album-reviews/dijon-baby-album-review/': "is debut, which is why I'm feeling a strong 7 to a light 8 on it.  Anthony Fantano, Dijon, Forever.\n"}
{'https://theneedledrop.com//album-reviews/joey-valence-brae-hy

In [10]:
from agents.fantano_website_db_maker import extract_fantano_rating

In [11]:
extracted_fantano_ratings = {album_review['url']: extract_fantano_rating(album_review['html'][-250:]) for album_review in processed_data}

In [15]:
for i, (url, rating) in enumerate(extracted_fantano_ratings.items()):
    print(f"URL: {url}\nRating: {rating}\n")
    if i >= 4:
        break

URL: https://theneedledrop.com//album-reviews/earl-sweatshirt-live-laugh-love-album-review/
Rating: Strong 7 to Light 8 (Pick Strong 7)

URL: https://theneedledrop.com//album-reviews/deftones-private-music-album-review/
Rating: Decent to Strong 7 (Pick Decent 7)

URL: https://theneedledrop.com//album-reviews/laufey-a-matter-of-time-album-review/
Rating: "Strong 5 to Light 6 (Pick Strong 5)"

URL: https://theneedledrop.com//album-reviews/ninajirachi-i-love-my-computer-album-review/
Rating: "Light 8"

URL: https://theneedledrop.com//album-reviews/chance-the-rapper-star-line-album-review/
Rating: Strong 6 to Light 7 (Pick Strong 6)



Not bad!!! It seems to have done a decent job! The problem is, if you got to page 200+, which hits album reviews from the early 2010s, the script isn't included in the body of the html. So those will likely be riddled with N/A ratings. We can infill some of the missing ones with the Genius rankings. For now, let's just ensure we save the data.

In [30]:
# AI sometimes includes the quotes in the rating string.
# Iterate through the extracted ratings and remove any leading/trailing double quotes from the rating strings.
for url, rating in extracted_fantano_ratings.items():
    # Check if the rating is a string before attempting to strip quotes.
    if isinstance(rating, str):
        extracted_fantano_ratings[url] = rating.strip('"')

fantano_website_data = pd.DataFrame(list(extracted_fantano_ratings.items()), columns=['url', 'rating'])
fantano_website_data.to_csv(os.path.join(processed_data_dir, "fantano_website_data.csv"), index=False)
fantano_website_data.head()

Unnamed: 0,url,rating
0,https://theneedledrop.com//album-reviews/earl-...,Strong 7 to Light 8 (Pick Strong 7)
1,https://theneedledrop.com//album-reviews/defto...,Decent to Strong 7 (Pick Decent 7)
2,https://theneedledrop.com//album-reviews/laufe...,Strong 5 to Light 6 (Pick Strong 5)
3,https://theneedledrop.com//album-reviews/ninaj...,Light 8
4,https://theneedledrop.com//album-reviews/chanc...,Strong 6 to Light 7 (Pick Strong 6)


In [117]:
fantano_website_data.shape

(281, 2)

Seems that only recently, did Anthony manage to have URLS that ended in "album-review" and his entire script in the website body. Seems that this approach, while a great exercise in scraping and extracting labels with AI, seems like a dead end. We likely need YouTube transcripts (perhaps in another notebook).

## Genius Scraping

We mentioned earlier that there is a Genius list that is heavily maintained, with some decent documentation. We can scrape this list to also create a dataset of albums and their ratings. One way to get a large coverage of Fantano's rankings is to join our results from Fantano's website with the Genius records.

In [28]:
r = requests.get("https://genius.com/Anthony-fantano-every-anthony-fantano-review-annotated")

In [29]:
rating_parser = BeautifulSoup(r.text, "html.parser")

In [38]:
year_paragraphs = rating_parser.find_all("p")
year_paragraphs

[<p class="SongPage__HeaderSpace-sc-82f56136-1 jANtLH"></p>,
 <p>• Gorillaz - <a href="https://genius.com/albums/Gorillaz/Plastic-beach" rel="noopener">"Plastic Beach"</a> (STRONG 7)<br/>• The Knife - <a href="https://genius.com/albums/The-knife/Tomorrow-in-a-year" rel="noopener">"Tomorrow, in a Year"</a> (LIGHT 3)<br/>• High on Fire - <a href="https://genius.com/albums/High-on-fire/Snakes-for-the-divine" rel="noopener">"Snakes for the Divine"</a> (DECENT 8)<br/>• <a class="ReferentFragment-desktop__ClickTarget-sc-380d78dd-0 eIfoVs" data-ignore-on-click-outside="true" href="/24536286/Anthony-fantano-every-anthony-fantano-review/Liars"><span class="ReferentFragment-desktop__Highlight-sc-380d78dd-1 fIkrDi">Liars</span></a><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span style="position:absolute;opacity:0;wid

In [110]:
genius_dict = {"artist": [], "album": [], "rating": []}

for year in year_paragraphs:
    albums = str(year).split("• ")[1:]
    for album in albums:
        try:
            artist_name, other = album.split(' - ', 1)
            try:
                artist_name_extractor = BeautifulSoup(artist_name, "html.parser")
                artist_name = artist_name_extractor.find("a").find("span").text
            except AttributeError:
                pass
            album_name = re.search(r'>".+"<', other)
            rating = re.findall(r'\(.+?\)', other)[-1]
            artist_name, album_name, rating = artist_name.strip(), album_name.group(0).strip()[1:-1], rating[1:-1]
            genius_dict["artist"].append(artist_name)
            genius_dict["album"].append(album_name)
            genius_dict["rating"].append(rating)
        except AttributeError:
            print("Error occured on: ", artist_name, album_name, rating)
            continue

Error occured on:  Havok None (STRONG 8-LIGHT 9)
Error occured on:  H-SIK None (STRONG 7)
Error occured on:  Baroness None (DECENT-STRONG 7)
Error occured on:  Gnaw Their Tongues None (DECENT-STRONG 7)
Error occured on:  Guardian Alien None (DECENT-STRONG 6)
Error occured on:  Frost Children None (DECENT-STRONG 7)


The maintainers had quite some fun putting some of these together since not all of the socres are in line with the ranking system Fantano uses (e.g. The Flaming Tips.)

In [111]:
genius_df = pd.DataFrame(genius_dict)
genius_df.head()

Unnamed: 0,artist,album,rating
0,Gorillaz,"""Plastic Beach""",STRONG 7
1,The Knife,"""Tomorrow, in a Year""",LIGHT 3
2,High on Fire,"""Snakes for the Divine""",DECENT 8
3,Liars,"""Sisterworld""",STRONG 6
4,Broken Bells,"""Broken Bells""",LIGHT 5


Looks like a decent scrape! We do have a few poorly matched albums as seen by the print-outs from the except statement. We can correct that manually. These usually boil down to a link being missing or a single quote instead of a double quote used on the Genius list.

In [115]:
correct_albums = {
    "artist": ["Havok","H-SIK","Baroness","Gnaw Their Tongues","Guardian Alien","Frost Children"],
    "album": ['"Time Is Up"','"Cocody"','"Yellow & Green"','"Eschatological Scatology"','"See the World Given to a One Love Entity"','"SPEED RUN"'],
    "rating": ["STRONG 8-LIGHT 9","STRONG 7","DECENT-STRONG 7","DECENT-STRONG 7","DECENT-STRONG 6","DECENT-STRONG 7"]
}

genius_df = pd.concat([genius_df, pd.DataFrame(correct_albums)], ignore_index=True)
genius_df[-6:]

Unnamed: 0,artist,album,rating
1055,Havok,"""Time Is Up""",STRONG 8-LIGHT 9
1056,H-SIK,"""Cocody""",STRONG 7
1057,Baroness,"""Yellow & Green""",DECENT-STRONG 7
1058,Gnaw Their Tongues,"""Eschatological Scatology""",DECENT-STRONG 7
1059,Guardian Alien,"""See the World Given to a One Love Entity""",DECENT-STRONG 6
1060,Frost Children,"""SPEED RUN""",DECENT-STRONG 7


In [118]:
genius_df.shape

(1061, 3)

Looks great! We can quickly check the unique rating categories to see if anything is incorrect.

In [116]:
genius_df["rating"].unique()

array(['STRONG 7', 'LIGHT 3', 'DECENT 8', 'STRONG 6', 'LIGHT 5',
       'STRONG 8', 'LIGHT 9', 'STRONG 3', 'STRONG 7-LIGHT 8', 'LIGHT 6',
       'STRONG 5-LIGHT 6', 'DECENT-STRONG 6', 'LIGHT 7',
       'STRONG 4-LIGHT 5', 'STRONG 6-LIGHT 7', 'DECENT-STRONG 5',
       'LIGHT-DECENT 4', 'DECENT 7', 'LIGHT 8', 'STRONG 5', 'DECENT 9',
       'DECENT-STRONG 7', 'STRONG 4', 'DECENT 5', '7', 'LIGHT 4',
       'DECENT-STRONG 8', 'LIGHT 2', 'STRONG 8-LIGHT 9', 'CHILL 8',
       'LIGHT-DECENT 5', 'NO SCORE', 'LIGHT-DECENT 6', 'STRONG 3-LIGHT 4',
       'LIGHT-DECENT 7', 'DECENT 6', 'FACE-PUNCHING, HEAD-CRUSHING 9',
       'ON THE FENCE 5', 'STRONG 2', 'LIGHT-DECENT 8', 'BEEFY 8',
       'ALRIGHT 5', 'STRONG 5-WEAK 6', 'HEYCOOLCHECKTHISOUT 7',
       'DECENT-STRONG 4',
       "STRONG THIS IS A PRETTY FREAKIN COOL SOUNDTRACK YOU SHOULD CHECK IT OUT IF YOU'RE INTO SCORES I'M NOT REALLY BUT I HAVE TO RECOGNIZE IMPRESSIVE PRODUCTION AND INSTRUMENTATION WHEN I HEAR IT YEAH IT'S PRETTY COOL THING YOU S

Clearly a couple "meme" responses that we can clean up to fit our description here.

(TO FINISH OR CLEAN UP LATER)

## AOTY Scraping

Anthony has actually contributed heavily to the reviews on the Album of the Year review website. This is a great way to get some higher coverage but this direction comes with a few setbacks. 

1) Cloudflare on this website blocks automated scraping, even though the robots.txt file does not explicitly forbid access to these pages. As a result, we need to manually save the HTML in order to extract the ratings.

2) The "NOT GOOD" reviews (and possibly the ones with meme scores) are not included in these reviews. 

3) The added detail of "light", "decent", or "strong" are not present in these reviews.

In [190]:
from bs4 import BeautifulSoup
import os

aoty_website_data = {
    "artist": [],
    "album": [],
    "album_image": [],
    "rating": []
}

NUM_REVIEW_PAGES = 54

for i in range(1, NUM_REVIEW_PAGES + 1):
    page = f"{i}.html"
    corresponding_files = f"{i}_files"
    with open(os.path.join(aoty_data_file_path, page), 'r', encoding='utf-8') as f:
        html_content = f.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    albums = soup.select(".albumBlock")

    for album in albums:
        artist_name = album.select_one(".artistTitle").text.strip()
        album_name = album.select_one(".albumTitle").text.strip()
        try:
            image = album.select_one(".lazyloaded").get("data-src").split("/")[-1]
            image_path = os.path.join(corresponding_files, image)
        except AttributeError:
            try:
                image = album.select_one(".lazyload").get("data-src").split("/")[-1]
                image_path = os.path.join(corresponding_files, image)
            except AttributeError:
                image = None
        try:    
            album_rating = album.select_one(".ratingBlock").text.strip()
        except AttributeError:
            album_rating = "0"
        print(f"Artist: {artist_name}, Album: {album_name}, Rating: {album_rating}")
        aoty_website_data["artist"].append(artist_name)
        aoty_website_data["album"].append(album_name)
        aoty_website_data["album_image"].append(image_path)
        aoty_website_data["rating"].append(album_rating)

aoty_website_data = pd.DataFrame(aoty_website_data)
aoty_website_data.to_csv(os.path.join(processed_data_dir, "aoty_website_data.csv"), index=False)
aoty_website_data.head(50)

Artist: Joey Bada$$, Album: Lonely At The Top, Rating: 60
Artist: Sabrina Carpenter, Album: Man's Best Friend, Rating: 50
Artist: Nourished By Time, Album: The Passionate Ones, Rating: 80
Artist: Mac DeMarco, Album: Guitar, Rating: 50
Artist: Laufey, Album: A Matter of Time, Rating: 50
Artist: Deftones, Album: private music, Rating: 70
Artist: Earl Sweatshirt, Album: Live Laugh Love, Rating: 80
Artist: Chance the Rapper, Album: STAR LINE, Rating: 60
Artist: Joey Valence & Brae, Album: HYPERYOUTH, Rating: 70
Artist: Dijon, Album: Baby, Rating: 70
Artist: Anamanaguchi, Album: Anyway, Rating: 70
Artist: The Black Keys, Album: No Rain, No Flowers, Rating: 70
Artist: Ethel Cain, Album: Willoughby Tucker, I'll Always Love You, Rating: 70
Artist: Ninajirachi, Album: I Love My Computer, Rating: 80
Artist: JID, Album: God Does Like Ugly, Rating: 80
Artist: Yeat, Album: DANGEROUS SUMMER, Rating: 70
Artist: Metro Boomin, Album: Metro Boomin Presents: A Futuristic Summa (Hosted by DJ Spinz), Ratin

Unnamed: 0,artist,album,album_image,rating
0,Joey Bada$$,Lonely At The Top,1_files\1395334-lonely-at-the-top_212732.jpg,60
1,Sabrina Carpenter,Man's Best Friend,1_files\1350079-mans-best-friend_160607.jpg,50
2,Nourished By Time,The Passionate Ones,1_files\1147985-the-passionate-ones_140850.jpg,80
3,Mac DeMarco,Guitar,1_files\1342445-untitled_130348.jpg,50
4,Laufey,A Matter of Time,1_files\1317471-a-matter-of-time_230439.jpg,50
5,Deftones,private music,1_files\1383369-private-music_221559.jpg,70
6,Earl Sweatshirt,Live Laugh Love,1_files\1431926-live-laugh-love_163817.jpg,80
7,Chance the Rapper,STAR LINE,1_files\515977-star-line_174737.jpg,60
8,Joey Valence & Brae,HYPERYOUTH,1_files\1374368-hyperyouth_182825.jpg,70
9,Dijon,Baby,1_files\1406473-baby_194842.jpg,70


In [173]:
aoty_website_data.iloc[0]["album_image"]

'1_files\\1350079-mans-best-friend_160607.jpg'

In [135]:
aoty_website_data.shape

(3214, 3)

Incredible coverage! Between our three sources here, we have thousands of albums to use for our prediction of future Anthony Fantano review scores.

If time permits, we could potentially scrape all of the transcripts from the youtube channel and see if we get some more detailed ratings. As it stands, we have at least 3214 clean records and could arrive at more if we make some additional usage of the "NOT GOOD" ratings.