# Lab 04: Scraping Reviews

**GOALS**: 

- Scrape album reviews from Pitchfork
- Scrape album images from Pitchfork


## LEVEL I

In the last example [*intro to webscraping*](08-Beautiful-Soup-Scraping.ipynb), we extracted basic information from the page containing all reviews on **pitchfork.com**.  Now, your task is first, to scrape the links to each review page.  This is akin to clicking on the review, and being taken to the page with the full review.

![](images/pitch_ind.png)

At each page, your goal is to scrape the headline, the text of the review, the score as a number, the author, genre, and date.  If you're feeling ambitious, grab the sample music files when they exist.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
url = 'https://pitchfork.com/reviews/albums'

In [3]:
siteUrl = 'https://pitchfork.com'

In [4]:
response = requests.get(url)

In [5]:
response

<Response [200]>

In [6]:
response.text[:1000]

'<!DOCTYPE html><html lang="en"><head><title data-react-helmet="true">New Albums &amp; Music Reviews | Pitchfork</title><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"/><meta data-react-helmet="true" name="og:type" content="website"/><meta data-react-helmet="true" name="og:site_name" content="Pitchfork"/><meta data-react-helmet="true" name="og:title" content="Pitchfork"/><meta data-react-helmet="true" name="og:url" content="https://pitchfork.com"/><meta data-react-helmet="true" name="description" content="Daily reviews of every important album in music"/><meta data-react-helmet="true" name="og:description" content="Daily reviews of every important album in music"/><script async="" src="/fonts-css/load-fonts.min.js"></script><link data-react-helmet="true" rel="shortcut icon" type="image/png" href="https://cdn.pitchfork.com/assets/misc/favicon-32.png"/><link data-react-helmet="true" rel="icon" type="image/png" href="https:/

In [7]:
soup = BeautifulSoup(response.text, 'html.parser')

In [8]:
soup.find('div', {'class': 'review'})

<div class="review"><a class="review__link" href="/reviews/albums/fucked-up-dose-your-dreams/"><div class="review__artwork artwork"><div class=""><img alt="" src="https://media.pitchfork.com/photos/5bad03d666ff630650f8e77d/1:1/w_160/fuckedup.jpg"/></div></div><div class="review__title"><ul class="artist-list review__title-artist"><li>Fucked Up</li></ul><h2 class="review__title-album">Dose Your Dreams</h2></div></a><div class="review__meta"><ul class="genre-list genre-list--inline review__genre-list"><li class="genre-list__item"><a class="genre-list__link" href="/reviews/albums/?genre=metal">Metal</a></li></ul><ul class="authors"><li><a class="linked display-name display-name--linked" href="/staff/ian-cohen/"><span class="by">by: </span>Ian Cohen</a></li></ul><time class="pub-date" datetime="2018-10-08T05:00:00" title="Mon, 08 Oct 2018 05:00:00 GMT">36 mins ago</time></div></div>

In [9]:
soup.find('div', {'class': 'review'}).find('a')

<a class="review__link" href="/reviews/albums/fucked-up-dose-your-dreams/"><div class="review__artwork artwork"><div class=""><img alt="" src="https://media.pitchfork.com/photos/5bad03d666ff630650f8e77d/1:1/w_160/fuckedup.jpg"/></div></div><div class="review__title"><ul class="artist-list review__title-artist"><li>Fucked Up</li></ul><h2 class="review__title-album">Dose Your Dreams</h2></div></a>

In [10]:
a = soup.find('div', {'class': 'review'}).find('a', href=True)
print ("found URL:", a['href'])

found URL: /reviews/albums/fucked-up-dose-your-dreams/


In [11]:
reviews = soup.find_all('div', {'class': 'review'})

In [12]:
artists = []
albums = []
for review in reviews:
    t = review.find('li').text
    artists.append(t)
    s = review.find('h2').text
    albums.append(s)

--

__list of links to review pages__

In [13]:
links = []
for review in reviews:
    a = review.find('a', href = True)
    l = siteUrl + a['href']
    links.append(l)

In [14]:
links #all links to review pages

['https://pitchfork.com/reviews/albums/fucked-up-dose-your-dreams/',
 'https://pitchfork.com/reviews/albums/the-joy-formidable-aaarth/',
 'https://pitchfork.com/reviews/albums/maxwell-embrya/',
 'https://pitchfork.com/reviews/albums/madeline-kenney-perfect-shapes/',
 'https://pitchfork.com/reviews/albums/j-dilla-welcome-2-detroit/',
 'https://pitchfork.com/reviews/albums/stereolab-switched-on-refried-ectoplasm-aluminum-tunes/',
 'https://pitchfork.com/reviews/albums/foodman-aru-otoko-no-densetsu/',
 'https://pitchfork.com/reviews/albums/author-and-publisher-beastland/',
 'https://pitchfork.com/reviews/albums/ana-da-silva-phew-island/',
 'https://pitchfork.com/reviews/albums/cat-power-wanderer/',
 'https://pitchfork.com/reviews/albums/tom-petty-an-american-treasure/',
 'https://pitchfork.com/reviews/albums/marissa-nadler-for-my-crimes/']

In [15]:
response = requests.get(links[0])

In [16]:
response

<Response [200]>

In [17]:
response.text[:1000]

'<!DOCTYPE html><html lang="en"><head><title data-react-helmet="true">Fucked Up: Dose Your Dreams Album Review | Pitchfork</title><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"/><meta data-react-helmet="true" name="og:site_name" content="Pitchfork"/><meta data-react-helmet="true" name="description" content="The art-hardcore band’s fifth album is a dynamic departure for the group, a long, psychedelic, concept-heavy odyssey that dips into many genres along the way."/><meta data-react-helmet="true" name="og:url" content="https://pitchfork.com/reviews/albums/fucked-up-dose-your-dreams/"/><meta data-react-helmet="true" name="og:title" content="Fucked Up: Dose Your Dreams"/><meta data-react-helmet="true" name="og:description" content="The art-hardcore band’s fifth album is a dynamic departure for the group, a long, psychedelic, concept-heavy odyssey that dips into many genres along the way."/><meta data-react-helmet="true" nam

In [18]:
#scrape the headline, the text of the review, the score as a number, the author, genre, and date

In [19]:
soup1 = BeautifulSoup(response.text, 'html.parser')

In [20]:
soup1.find('div', {'class': 'review-detail__abstract'}).find('p').text #headline

'The art-hardcore band’s fifth album is a dynamic departure for the group, a long, psychedelic, concept-heavy odyssey that dips into many genres along the way.'

In [21]:
soup1.find('div', {'class': 'contents'}).find('p').text

'Damian “Pink Eyes” Abraham has made a career on a being a bit much: The Canadian punk recently produced an extreme wrestling documentary called Bloodlust and looks like he could get in the ring himself, particularly when the burly, bearded, and frequently shirtless frontman of Fucked Up smashes bottles over his head on stage. He always sings like he’s trying to exfoliate his larynx with loose pieces of his ribcage and they’re the most abrasive vocals anyone will encounter from a band putting out records on Merge. Glass Boys, from 2014, represented Abraham’s purist vision of Fucked Up, a punk rock teleology that traced DIY ethics back to the ancient Greeks and had more guitar overdubs than a Smashing Pumpkins album.'

In [22]:
content = soup1.find('div', {'class': 'contents'}).find_all('p')

In [23]:
content

[<p>Damian “Pink Eyes” Abraham has made a career on a being a bit much: The Canadian punk recently produced an extreme wrestling documentary called <em>Bloodlust</em> and looks like he could get in the ring himself, particularly when the burly, bearded, and frequently shirtless frontman of Fucked Up <a href="https://www.youtube.com/watch?v=Mo2ozoeoo_k">smashes bottles over his head on stage</a>. He always sings like he’s trying to exfoliate his larynx with loose pieces of his ribcage and they’re the most abrasive vocals anyone will encounter from a band putting out records on Merge. <a href="https://pitchfork.com/reviews/albums/19400-fucked-up-glass-boys/"><em>Glass Boys</em></a>, from 2014, represented Abraham’s purist vision of Fucked Up, a punk rock teleology that traced DIY ethics back to the ancient Greeks and had more guitar overdubs than a <a href="https://pitchfork.com/artists/3838-the-smashing-pumpkins/">Smashing Pumpkins</a> album.</p>,
 <p>Yet, compared to the band’s double-

In [24]:
reviewText = []
for paragraph in content:
    p = paragraph.text
    reviewText.append(p)

In [25]:
reviewText #review text

['Damian “Pink Eyes” Abraham has made a career on a being a bit much: The Canadian punk recently produced an extreme wrestling documentary called Bloodlust and looks like he could get in the ring himself, particularly when the burly, bearded, and frequently shirtless frontman of Fucked Up smashes bottles over his head on stage. He always sings like he’s trying to exfoliate his larynx with loose pieces of his ribcage and they’re the most abrasive vocals anyone will encounter from a band putting out records on Merge. Glass Boys, from 2014, represented Abraham’s purist vision of Fucked Up, a punk rock teleology that traced DIY ethics back to the ancient Greeks and had more guitar overdubs than a Smashing Pumpkins album.',
 'Yet, compared to the band’s double-album rock operas and wooly Zodiac EPs, Glass Boys was a model of hardcore austerity, and its mild reception felt like a referendum on guitarist Mike Haliechuk ceding his artistic control. The line on Fucked Up is that they’ve been ex

In [26]:
firstReview = ""
for text in reviewText:
    firstReview = firstReview + text

In [27]:
firstReview

'Damian “Pink Eyes” Abraham has made a career on a being a bit much: The Canadian punk recently produced an extreme wrestling documentary called Bloodlust and looks like he could get in the ring himself, particularly when the burly, bearded, and frequently shirtless frontman of Fucked Up smashes bottles over his head on stage. He always sings like he’s trying to exfoliate his larynx with loose pieces of his ribcage and they’re the most abrasive vocals anyone will encounter from a band putting out records on Merge. Glass Boys, from 2014, represented Abraham’s purist vision of Fucked Up, a punk rock teleology that traced DIY ethics back to the ancient Greeks and had more guitar overdubs than a Smashing Pumpkins album.Yet, compared to the band’s double-album rock operas and wooly Zodiac EPs, Glass Boys was a model of hardcore austerity, and its mild reception felt like a referendum on guitarist Mike Haliechuk ceding his artistic control. The line on Fucked Up is that they’ve been expandin

--

__finding the headlines of the reviews__

In [28]:
headline = ""
headlines = []
loop = 0
for link in links:
    response = requests.get(links[loop])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    headline = links[loop] + headline
    headline = contentSoup.find('div', {'class': 'review-detail__abstract'}).find('p').text
    headlines.append(headline)
    loop = loop + 1

In [29]:
headlines

['The art-hardcore band’s fifth album is a dynamic departure for the group, a long, psychedelic, concept-heavy odyssey that dips into many genres along the way.',
 'On their fourth album, the Welsh rockers build their towering songs on wobblier foundations for the sheer thrill of trying to make them topple.',
 'The reissue of Maxwell’s second album from 1998 showcases the mercurial spirit that followed the R&B auteur down new, aqueous corridors.',
 'On her second album, the sophisticated art-rock singer steps toward the center of her own songs, thanks in part to production from Wye Oak’s Jenn Wasner.',
 'Each Sunday, Pitchfork takes an in-depth look at a significant album from the past, and any record not in our archives is eligible. Today, we revisit a piece of Detroit history that rippled through all of hip-hop.',
 'Singles and splits documented a band’s between-albums evolution during the 1990s. These compilations, remastered and reissued, reveal that process for one of the era’s mo

--

__list of text from reviews__

In [30]:
reviewTexts = []
p = ""
allReviews = []
loop1 = 0
for link in links:
    response = requests.get(links[loop1])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    content = contentSoup.find('div', {'class': 'contents'}).find_all('p')
    for paragraph in content:
        p = p + paragraph.text
    reviewTexts.append(p)
    p = ""
    loop1 = loop1 + 1

In [31]:
reviewTexts[0:2]

['Damian “Pink Eyes” Abraham has made a career on a being a bit much: The Canadian punk recently produced an extreme wrestling documentary called Bloodlust and looks like he could get in the ring himself, particularly when the burly, bearded, and frequently shirtless frontman of Fucked Up smashes bottles over his head on stage. He always sings like he’s trying to exfoliate his larynx with loose pieces of his ribcage and they’re the most abrasive vocals anyone will encounter from a band putting out records on Merge. Glass Boys, from 2014, represented Abraham’s purist vision of Fucked Up, a punk rock teleology that traced DIY ethics back to the ancient Greeks and had more guitar overdubs than a Smashing Pumpkins album.Yet, compared to the band’s double-album rock operas and wooly Zodiac EPs, Glass Boys was a model of hardcore austerity, and its mild reception felt like a referendum on guitarist Mike Haliechuk ceding his artistic control. The line on Fucked Up is that they’ve been expandi

In [32]:
soup1 = BeautifulSoup(response.text, 'html.parser')

In [33]:
soup1.find('div', {'class': 'score-circle'}).find("span").text

'7.2'

--

__list of scores as floats__

In [34]:
scores = []
loop2 = 0;
for link in links:
    response = requests.get(links[loop2])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    pew = contentSoup.find('div', {'class': 'score-circle'}).find("span").text
    scores.append(float(pew))
    loop2 = loop2 + 1

In [35]:
scores

[7.3, 6.9, 8.3, 7.2, 8.5, 7.7, 7.5, 7.1, 7.3, 7.4, 8.3, 7.2]

In [36]:
soup1.find('ul', {'class': 'authors-detail'}).find('a').text

'Olivia Horn '

--

__list of review authors__

In [37]:
authors = []
loop3 = 0;
for link in links:
    response = requests.get(links[loop3])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    luck = contentSoup.find('ul', {'class': 'authors-detail'}).find('a').text
    authors.append(luck)
    loop3 = loop3 + 1

In [38]:
authors

['Ian Cohen',
 'Stuart Berman',
 'Brad Nelson',
 'Allison Hussey',
 'Edwin “STATS” Houghton',
 'Philip Sherburne',
 'Andy Beta',
 'Brian Howe',
 'Sasha Geffen',
 'Jayson Greene',
 'Sam Sodomsky',
 'Olivia Horn ']

In [39]:
response = requests.get(links[5])

In [40]:
response

<Response [200]>

In [41]:
response.text[:1000]

'<!DOCTYPE html><html lang="en"><head><title data-react-helmet="true">Stereolab: Switched On / Refried Ectoplasm / Aluminum Tunes Album Review | Pitchfork</title><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"/><meta data-react-helmet="true" name="og:site_name" content="Pitchfork"/><meta data-react-helmet="true" name="description" content="Singles and splits documented a band’s between-albums evolution during the 1990s. These compilations, remastered and reissued, document that process for one of the era’s most innovative groups."/><meta data-react-helmet="true" name="og:url" content="https://pitchfork.com/reviews/albums/stereolab-switched-on-refried-ectoplasm-aluminum-tunes/"/><meta data-react-helmet="true" name="og:title" content="Stereolab: Switched On / Refried Ectoplasm / Aluminum Tunes"/><meta data-react-helmet="true" name="og:description" content="Singles and splits documented a band’s between-albums evolution duri

In [42]:
soup1 = BeautifulSoup(response.text, 'html.parser')

In [43]:
soup1.find('ul', {'class': 'genre-list'})

<ul class="genre-list genre-list--before"><li class="genre-list__item"><a class="genre-list__link" href="/reviews/albums/?genre=experimental">Experimental</a></li></ul>

In [44]:
soup1.find('ul', {'class': 'genre-list'}).find('li').text #if no genre list, then error

'Experimental'

--

**list of genres**

In [55]:
genres = []
multGenre = []
ch = ""
loop4 = 0;
genreLoop = 0;
for link in links:
    response = requests.get(links[loop4])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    if len(contentSoup.find_all('ul', {'class': 'genre-list'})) != 0: #make sure there is a genre
        womp = contentSoup.find('ul', {'class': 'genre-list'}).find_all('li')
        for music in womp: 
            multGenre.append(music.text)
        if len(womp) > 1: #if multiple genre put them in one list slot
            for gen in multGenre:
                ch = ch + "/" + gen
        else: # if just one genre put it in one list slot
            ch = music.text
        genres.append(ch)
        ch = ""
    else:
        genres.append("")
    loop4 = loop4 + 1

In [56]:
genres

['Metal',
 'Rock',
 'Pop/R&B',
 'Rock',
 'Rap',
 'Experimental',
 'Experimental',
 'Metal',
 'Experimental',
 'Rock',
 'Rock',
 'Folk/Country']

In [47]:
soup1.find('time', {'class': 'pub-date'}).text

'October 6 2018'

--

**list of dates**

In [48]:
pubDates = []
loop5 = 0;
for link in links:
    response = requests.get(links[loop5])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    time = contentSoup.find('time', {'class': 'pub-date'}).text
    pubDates.append(time)
    loop5 = loop5 + 1

In [49]:
pubDates

['37 mins ago',
 '37 mins ago',
 '37 mins ago',
 '36 mins ago',
 'October 7 2018',
 'October 6 2018',
 'October 6 2018',
 'October 6 2018',
 'October 6 2018',
 'October 5 2018',
 'October 5 2018',
 'October 5 2018']

In [50]:
soup1.find('div', {'class': 'player-display'}).find('span', {'class': 'source'}).find('a').text

'Bandcamp'

--

**list of music links**

In [53]:
musicLinks = []
loop6 = 0
for link in links:
    response = requests.get(links[loop6])
    contentSoup = BeautifulSoup(response.text, 'html.parser')
    if len(contentSoup.find_all('div', {'class': 'player-display'})) != 0: # make sure there is a music file
        reviewMusic = contentSoup.find('div', {'class': 'player-display'}).find_all('span', {'class': 'source'})
        for linkz in reviewMusic:
            a = linkz.find('a', href=True)
            l = a['href']
            musicLinks.append(l)
    else:
        musicLinks.append("")
    loop6 = loop6 + 1

In [54]:
musicLinks

['',
 'https://thejoyformidableofficial.bandcamp.com/track/the-wrong-side',
 '',
 'https://madelinekenney.bandcamp.com/track/bad-idea',
 '',
 'https://stereolab.bandcamp.com/track/iron-man',
 'https://soundcloud.com/sun-araw/07-mizu-youkan',
 'https://authorandpunisher.bandcamp.com/track/nihil-strength',
 'https://anadasilva.bandcamp.com/track/the-fear-song',
 '',
 '',
 'https://marissanadler.bandcamp.com/track/for-my-crimes']

## LEVEL II

Go back to the original page of reviews and scroll down.  Notice that the url at the top of the page is simply adding numbers as it advances.  This pattern will allow you to scrape multiple pages, and gather more reviews from earlier dates.  

1. Directly add the next reviews to a new url, and use your pattern above to scrape the additional reviews.
2. Write a loop to go through the next ten pages of reviews and gather each piece.

## LEVEL III



Write a loop to go through all reviews available.  Save the results as a `.csv` file.  If you were able to scape the images; store these in a folder.

## LEVEL IV

It is easy to use the `textblob` library to add sentiment and polarity of reviews to our `DataFrame`.  We need to convert the text to a `TextBlob` object, and then use the `.polarity` and `.subjectivity` labels of the text as new columns in our `DataFrame`.  Use the example below as a starting place to add two new columns to your dataframe containing the polarity and subjectivity scores for each review.

In [1]:
rev = "Danielle Bregoli’s leap from meme to rapper continues with her debut mixtape that leans heavily on mimicry and trails dreadfully behind the current sound of hip-hop."

In [2]:
from textblob import TextBlob

In [3]:
text = TextBlob(rev)

In [5]:
text.sentiment

Sentiment(polarity=-0.05000000000000002, subjectivity=0.5)

In [6]:
text.polarity

-0.05000000000000002

In [8]:
text.subjectivity

0.5