## Pitchfork Web Scraper
This notebook goes through the process of scraping a website, Pitchfork, for their new music recommendations.

In [1]:
#Imports
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
#Create a Request object by gathering information from the URL below
res = requests.get('https://pitchfork.com/best/')

In [3]:
#Check that the request worked (status code should be 200)
res.status_code

200

In [4]:
#Take a look at the text. Wow... this is a lot!
res.text[:1000]

'<!DOCTYPE html><html lang="en"><head><title data-react-helmet="true">Best New Music: Tracks, Albums &amp; Reissues | Pitchfork</title><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"/><meta data-react-helmet="true" name="og:type" content="website"/><meta data-react-helmet="true" name="og:site_name" content="Pitchfork"/><meta data-react-helmet="true" name="og:title" content="Pitchfork"/><meta data-react-helmet="true" name="og:url" content="https://pitchfork.com"/><meta data-react-helmet="true" name="description" content="The most exciting and important music being released today"/><meta data-react-helmet="true" name="og:description" content="The most exciting and important music being released today"/><script async="" src="/fonts-css/load-fonts.min.js"></script><link data-react-helmet="true" rel="shortcut icon" type="image/png" href="https://cdn.pitchfork.com/assets/misc/favicon-32.png"/><link data-react-helmet="true" rel=

In [5]:
#Convert the raw text into a BS object. This is much easier to sort through!
best = BeautifulSoup(res.text, 'lxml')

#All the albums or tracks have the following class tages
class_tags = ['bnm-hero__review-block', 'bnm-small album-small', 'bnm-small track-small']

#This is a list of all the entries, looping through this should gather all of the information
reviews = best.find_all('div', {'class' : class_tags})

In [6]:
#Checking that the length of the lists matches the number of entires I counted on the site
len(reviews)

21

What to keep for each entry:
- Artist
- Title of work
- Preview of review
- Writer
- Rating
- Link to Pitchfork’s full review
- Distinguish if it is an album or a song
- Genre
- Artwork

In [7]:
#Here's a preview of what just one looks like
reviews[1]

<div class="bnm-hero__review-block"><a class="bnm-hero__link-block" href="/reviews/tracks/nazar-bunker-ft-shannen-sp/"><div class="artwork bnm-hero__artwork bnm-hero__artwork--with-notch"><img alt="" class="bnm-hero__img" src="https://media.pitchfork.com/photos/5f08b15526e46edd30202daf/1:1/w_320/bunker.jpg"/></div><div class="details"><h3 class="bnm-hero__artist"><ul class="artist-list"><li>Nazar</li></ul></h3><h3 class="bnm-hero__title">“Bunker” [ft. Shannen SP]</h3></div></a><ul class="authors"><li><a class="linked display-name display-name--linked" href="/staff/eric-torres/"><span class="by">by: </span>Eric Torres</a></li></ul><p class="bnm-hero__date"><time class="pub-date" datetime="2020-07-10T19:31:58" title="Fri, 10 Jul 2020 19:31:58 GMT">July 10 2020</time></p></div>

In [8]:
reviews[0].find('img').attrs['src']

'https://media.pitchfork.com/photos/5f284eb719fdf81aa4467942/1:1/w_320/Microphones%20In%202020_The%20Microphones.jpg'

All the information below can be found on the main page:

Because of the formatting Pitchfork uses, the code below will only work for the first three entries. The rest use different tags:

In [9]:
#Artist
print(reviews[0].find_all(['h3'])[0].text)

#Title of Album or Track
print(reviews[0].find_all(['h3'])[1].text)

#Link to Pitchfork's review
print(reviews[0].find('a').attrs['href'])

#Album or track? This can be found in the URL
print(reviews[0].find('a').attrs['href'].split('/')[2][:-1].title())

#Genre
print([genre.text for genre in reviews[0].find_all('li', {'class' : 'genre-list__item'})])

#Artwork
print(reviews[0].find('img').attrs['src'])

The Microphones
Microphones in 2020
/reviews/albums/the-microphones-microphones-in-2020/
Album
['Experimental', 'Rock']
https://media.pitchfork.com/photos/5f284eb719fdf81aa4467942/1:1/w_320/Microphones%20In%202020_The%20Microphones.jpg


In [10]:
[i.text for i in reviews[5].find_all(['ul'], {'class' : 'artist-list'})]

['Special Interest']

In [11]:
#Artist
print(', '.join([i.text for i in reviews[5].find_all(['ul'], {'class' : 'artist-list'})]))

#Title of Album or Track
print(reviews[5].find(['h2']).text)

#Link to Pitchfork's review
print(reviews[5].find('a').attrs['href'])

#Album or track? This can be found in the URL
print(reviews[5].find('a').attrs['href'].split('/')[2][:-1].title())

#Genre
print([genre.text for genre in reviews[5].find_all('li', {'class' : 'genre-list__item'})])

#Artwork
print(reviews[5].find('img').attrs['src'])

Special Interest
The Passion Of
/reviews/albums/special-interest-the-passion-of/
Album
['Rock']
https://media.pitchfork.com/photos/5efb778bdc55f46b46323ec7/1:1/w_160/The%20Passion%20Of_Special%20Interest.jpg


The other data points can be found on the individual review pages:

In [12]:
rev_res = requests.get('https://pitchfork.com' + reviews[0].find('a').attrs['href'])

In [13]:
rev_res.status_code

200

In [14]:
rev = BeautifulSoup(rev_res.text, 'lxml')

In [15]:
#Review Preview
print(rev.find('div', {'class': 'review-detail__abstract'}).find('p').text)

#Score
print(rev.find('span', {'class': 'score'}).text)

#Pitchfork Writer
print(rev.find('a', {'class': 'authors-detail__display-name'}).text)

Phil Elverum resurrects his beloved Microphones alias for a 45-minute song about art-making, self-mythologizing, and the endless search for meaning. 
8.5
Quinn Moreland


In [16]:
''.join(rev.find('div', {'class' : 'contents'}).find('p').text.split('.')[:2])

'Before he borrowed the name of the mountain that looms over his hometown of Anacortes, Washington, Phil Elverum wrote and performed songs as the Microphones, named in tribute to his recording equipment, which seemed to breathe and swell with a life of its own In the summer of 2019, 16 years after the project’s last proper release, Elverum exhumed this moniker for a show filled with old friends'

Putting all the pieces together:

In [17]:
def get_pitchfork(URL):
    #Create a Request object by gathering information from the URL below
    res = requests.get(URL + '/best/')
    
    #Check that the request worked (status code should be 200)
    if res.status_code == 200:
        #Convert the raw text into a BS object. This is much easier to sort through!
        best = BeautifulSoup(res.text, 'lxml')

        #All the albums or tracks have the following class tages
        class_tags = ['bnm-hero__review-block', 'bnm-small album-small', 'bnm-small track-small']

        #This is a list of all the entries, looping through this should gather all of the information
        reviews = best.find_all('div', {'class' : class_tags})
        
        
        #List to hold all of the dicationaries:
        all_dict = []
        
        #For loop for each review. 
        for i, review in enumerate(reviews):
            
            #If it's one of the first 3, follow first strucutre, otherwise follow the second
            if i <= 2:
                entry = {
                #Artist
                'artist' : review.find_all(['h3'])[0].text,

                #Title of Album or Track
                'title' : review.find_all(['h3'])[1].text,

                #Link to Pitchfork's review
                'review_link' : URL + review.find('a').attrs['href'],

                #Album or track? This can be found in the URL
                'album_track' : review.find('a').attrs['href'].split('/')[2][:-1].title(),

                #Genre
                'genre' : [genre.text for genre in review.find_all('li', {'class' : 'genre-list__item'})],

                #Artwork
                'artwork' : review.find('img').attrs['src'],
                }
            else:
                entry = {
                #Artist
                'artist' : ', '.join([i.text for i in review.find_all(['ul'], {'class' : 'artist-list'})]),

                #Title of Album or Track
                'title' : review.find(['h2']).text,

                #Link to Pitchfork's review
                'review_link' : review.find('a').attrs['href'],

                #Album or track? This can be found in the URL
                'album_track' : review.find('a').attrs['href'].split('/')[2][:-1].title(),

                #Genre
                'genre' : [genre.text for genre in review.find_all('li', {'class' : 'genre-list__item'})],

                #Artwork
                'artwork' : review.find('img').attrs['src'],
                }
            
            rev_res = requests.get(URL + review.find('a').attrs['href'])
            
            if rev_res.status_code == 200:
                rev = BeautifulSoup(rev_res.text, 'lxml')
                
                if entry['album_track'] == 'Album':
                    #Review Preview
                    entry['preview'] = rev.find('div', {'class': 'review-detail__abstract'}).find('p').text
                    
                    #Score
                    entry['score'] = rev.find('span', {'class': 'score'}).text
                else:
                    #For track reviews, just include the first two sentences as preview
                    entry['preview'] = ''.join(rev.find('div', {'class' : 'contents'}).find('p').text.split('.')[:2])
                    
                    #Score
                    entry['score'] = 'N/A'
                
                

                #Pitchfork Writer
                entry['author'] = rev.find('a', {'class': 'authors-detail__display-name'}).text
                    
            all_dict.append(entry)
            
    return pd.DataFrame(all_dict)
            

In [18]:
%%time
new_music = get_pitchfork('https://pitchfork.com/')

CPU times: user 1.26 s, sys: 71.8 ms, total: 1.33 s
Wall time: 5.05 s


In [19]:
new_music

Unnamed: 0,artist,title,review_link,album_track,genre,artwork,preview,score,author
0,The Microphones,Microphones in 2020,https://pitchfork.com//reviews/albums/the-micr...,Album,"[Experimental, Rock]",https://media.pitchfork.com/photos/5f284eb719f...,Phil Elverum resurrects his beloved Microphone...,8.5,Quinn Moreland
1,Nazar,“Bunker” [ft. Shannen SP],https://pitchfork.com//reviews/tracks/nazar-bu...,Track,[],https://media.pitchfork.com/photos/5f08b15526e...,For the Manchester-based Angolan electronic ar...,,Eric Torres
2,that dog.,Totally Crushed Out! / Retreat From the Sun,https://pitchfork.com//reviews/albums/that-dog...,Album,[Rock],https://media.pitchfork.com/photos/5f1efd858fd...,Reissues of the L.A. band’s mid-’90s albums ca...,8.3,Jenn Pelly
3,Dehd,Flower of Devotion,/reviews/albums/dehd-flower-of-devotion/,Album,[Rock],https://media.pitchfork.com/photos/5f0e1f47763...,"Recording in a studio, the Chicago DIY trio so...",8.3,Steven Arroyo
4,Julianna Barwick,Healing Is a Miracle,/reviews/albums/julianna-barwick-healing-is-a-...,Album,[Electronic],https://media.pitchfork.com/photos/5f073c85ca5...,The vocalist and producer Juliana Barwick’s re...,8.3,Will Gottsegen
5,Special Interest,The Passion Of,/reviews/albums/special-interest-the-passion-of/,Album,[Rock],https://media.pitchfork.com/photos/5efb778bdc5...,"Mixing art-punk, industrial, and techno, the o...",8.4,Jenn Pelly
6,Jessie Ware,What’s Your Pleasure?,/reviews/albums/jessie-ware-whats-your-pleasure/,Album,[Pop/R&B],https://media.pitchfork.com/photos/5ef619d0881...,"On her disco-inspired new album, Ware sounds b...",8.3,Owen Myers
7,Haim,Women in Music Pt. III,/reviews/albums/haim-women-in-music-pt-iii/,Album,[Rock],https://media.pitchfork.com/photos/5ef25198881...,The third album from the trio is far and away ...,8.6,Aimee Cliff
8,Phoebe Bridgers,Punisher,/reviews/albums/phoebe-bridgers-punisher/,Album,[Folk/Country],https://media.pitchfork.com/photos/5ee923f47bb...,"On her marvelous second album, Phoebe Bridgers...",8.7,Sam Sodomsky
9,Moor Motherbilly woods,“Furies”,/reviews/tracks/moor-mother-billy-woods-furies/,Track,"[Experimental, Rap]",https://media.pitchfork.com/photos/5f077ec385f...,"On “Furies,” Moor Mother and billy woods work ...",,Sheldon Pearce


In [20]:
new_music.to_csv('./new_music.csv')