# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

In [1]:
import re
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

In [2]:
# My code
events_page = requests.get('https://www.residentadvisor.net/events/us/massachusetts') # Make a get request to retrieve the page
soup = BeautifulSoup(events_page.content, 'html.parser')

In [None]:
# Their code
response = requests.get("https://www.residentadvisor.net/events/us/newyork")
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
# My code
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html lang="en,ja,es">
<head id="_x1"><title>
	RA: Events in Massachusetts, United States of America
</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="en,ja,es" http-equiv="content-language"/><meta content="RA: Resident Advisor" name="Description"/><meta content="RA, residentadvisor, resident, advisor, music, ra, events, in, massachusetts, united, states, america" name="Keywords"/><meta content="Resident Advisor" name="Author"/><meta content="Resident Advisor" property="og:site_name"/><meta content="712773712080127" property="fb:app_id"/><link href="/bundles/default-css?v=ATv7yC5anBBrxJoYdSr-DqUPyab_mqaaXHG0qxMzlYI1" rel="stylesheet"/>
<meta content="app-id=981952703, app-argument=ra-guide://search" name="apple-itunes-app"/><link href="/bundles/cat-listings-css?v=qgpSmyPbylOKeJFqy2yvCrTgAsw9yQYcJtLKS_vPO6s1" rel="stylesheet"/>
<link href="/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/><li

In [4]:
event_list = soup.find('div', class_="strip slide small")
event_list

<div class="strip slide small" data-type="events" id="events-listing">
<ul class="list small clearfix popular" style="padding: 0;">
<li class="">
<article class="highlight-top">
<p>Thu, 7 Nov 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1304938"><img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/></a>
<p class="counter nohide">
<span>12</span> attending
</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1304938">
<h1>
Claude Vonstroke
</h1>
</a>
<p class="copy nohide">
<a href="\club.aspx?id=146083">The Grand Boston</a>
</p>
</article>
</li><li class="">
<article class="highlight-top">
<p>Fri, 8 Nov 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1331002"><img class="nohide" src="/images/events/flyer/2019/11/us-1108-1331002-list.jpg"/></a>
<p class="counter nohide">
<span>44</span> attending
<

In [5]:
next_container = event_list.nextSibling.nextSibling 
next_container

<div class="link-more mobile-only tablet-off">
<a></a>
</div>

In [26]:
titles = event_list.findAll('li') # Make a selection
titles[1]

<li class="">
<article class="highlight-top">
<p>Fri, 8 Nov 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1331002"><img class="nohide" src="/images/events/flyer/2019/11/us-1108-1331002-list.jpg"/></a>
<p class="counter nohide">
<span>44</span> attending
</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1331002">
<h1>
Visceral X Mess
</h1>
</a>
<p class="copy nohide">
<a href="\club.aspx?id=131409">The Lower Level</a>
</p>
</article>
</li>

In [14]:
titles[0].find('h1')

<h1>
Claude Vonstroke
</h1>

In [17]:
titles[0].find('img')

<img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>

In [32]:
final_titles = [event_list.find('h1') for h1 in event_list.findAll('h1')]
print(len(final_titles), final_titles[:])

8 [<h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>]


In [33]:
new_event_list = soup.find('ul', class_="list small clearfix popular")
new_event_list

<ul class="list small clearfix popular" style="padding: 0;">
<li class="">
<article class="highlight-top">
<p>Thu, 7 Nov 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1304938"><img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/></a>
<p class="counter nohide">
<span>12</span> attending
</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1304938">
<h1>
Claude Vonstroke
</h1>
</a>
<p class="copy nohide">
<a href="\club.aspx?id=146083">The Grand Boston</a>
</p>
</article>
</li><li class="">
<article class="highlight-top">
<p>Fri, 8 Nov 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1331002"><img class="nohide" src="/images/events/flyer/2019/11/us-1108-1331002-list.jpg"/></a>
<p class="counter nohide">
<span>44</span> attending
</p>
<a ga-event-action="popular-events" ga-event-category="events-page"

In [36]:
new_titles = new_event_list.findAll('h1') # Make a selection
new_titles[2]

<h1>
Sure Thing: Feral (Hypnus)
</h1>

In [40]:
new_final_titles = [new_titles.find('h1') for h1 in new_titles.findAll('h1')]
print(len(new_final_titles), new_final_titles[:])

AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

In [42]:
all_final_titles = [new_event_list.find('h1') for h1 in new_event_list.findAll('h1')]
print(len(final_titles), final_titles[:5])

8 [<h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>, <h1>
Claude Vonstroke
</h1>]


In [49]:
images = new_event_list.findAll('img')
ex_img = images[2] # Preview an entry
ex_img

<img class="nohide" src="/images/events/flyer/2019/11/us-1109-1336209-list.jpg"/>

In [51]:
final_images = [new_event_list.find('img') for img in new_event_list.findAll('img')]
print(len(final_images), final_images[:])

8 [<img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>, <img class="nohide" src="/images/events/flyer/2019/11/us-1107-1304938-list.jpg"/>]


## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [None]:
def scrape_events(events_page_url):
    #Your code here
    warning = soup.find('div', class_="alert alert-warning")
    book_container = warning.nextSibling.nextSibling
    titles = [h3.find('a').attrs['title'] for h3 in book_container.findAll('h3')]
    return titles
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

## Write a Function to Retrieve the URL for the Next Page

In [None]:
def next_page(url):
    #Your code here
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [None]:
#Your code here

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!

In [52]:
event_listings = soup.find('div', id="event-listing")

In [53]:
entries = event_listings.findAll('li')
print(len(entries), entries[0])

13 <li><p class="eventDate date"><a href="/events.aspx?ai=79&amp;v=day&amp;mn=11&amp;yr=2019&amp;dy=7"><span>Thu, 07 Nov 2019 /</span></a></p></li>


In [54]:
#Successive exploration in function development
rows = []
for entry in entries:
    #Is it a date? If so, set current date.
    date = entry.find('p', class_="eventDate date")
    event = entry.find('h1', class_="event-title")
    if event:
        details = event.text.split(' at ')
        event_name = details[0].strip()
        venue = details[1].strip()
        try:
            n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
        except:
            n_attendees = np.nan
        rows.append([event_name, venue, cur_date, n_attendees])
    elif date:
        cur_date = date.text
    else:
        continue
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,0,1,2,3
0,Claude Vonstroke,The Grand Boston,"Thu, 07 Nov 2019 /",12
1,"Takeover with Matt Mcneill, Mrph, Ryan Perkins",Zuzu,"Thu, 07 Nov 2019 /",3
2,Visceral X Mess,The Lower Level,"Fri, 08 Nov 2019 /",44
3,Sure Thing: Feral (Hypnus),The Lower Level,"Sat, 09 Nov 2019 /",18
4,vyo͞o presents: Binh,TBA - Boston,"Sat, 09 Nov 2019 /",10


In [55]:
#Final function
def scrape_events(events_page_url):
    #Your code here
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        #Is it a date? If so, set current date.
        date = entry.find('p', class_="eventDate date")
        event = entry.find('h1', class_="event-title")
        if event:
            details = event.text.split(' at ')
            event_name = details[0].strip()
            venue = details[1].strip()
            try:
                n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
            except:
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees])
        elif date:
            cur_date = date.text
        else:
            continue
    df = pd.DataFrame(rows)
    df.head()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [56]:
# Function development cell
soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']

'/events/us/massachusetts/week/2019-11-14'

In [57]:
# Write a Function to Retrieve the URL for the Next Page
def next_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + url_ext
    #Your code here
    return next_page_url

In [58]:
# Scrape the Next 1000 Events for Your Area
dfs = []
total_rows = 0
cur_url = "https://www.residentadvisor.net/events/us/newyork"
while total_rows <= 1000:
    df = scrape_events(cur_url)
    dfs.append(df)
    total_rows += len(df)
    cur_url = next_page(cur_url)
    time.sleep(.2)
df = pd.concat(dfs)
df = df.iloc[:1000]
print(len(df))
df.head()

AttributeError: 'NoneType' object has no attribute 'attrs'