# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import re
import numpy as np

In [2]:
html_page = requests.get('https://www.residentadvisor.net/events')
soup = bs(html_page.content, 'html.parser')

In [3]:
url = 'https://www.residentadvisor.net/events'

In [4]:
def scrape_events(events_page_url):
    soup = bs(requests.get(events_page_url).content, 'html.parser')
    event_container = soup.find('div', class_="content clearfix")
    dates = [date.text.strip()[:-6] for date in event_container.findAll('time')]
    names = [name.find('a').text.strip() for name in event_container.findAll('h1'
                                            ,class_='event-title')]
    venues = [venue.find('span').text.strip()[3:] for venue 
                  in event_container.findAll('h1', class_='event-title')]
    attendees = [attendee.find('span').text.strip() for attendee 
                  in event_container.findAll('p', class_='attending')]
    df=pd.DataFrame([names, venues, dates, attendees]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [5]:
scrape_events(url)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Clinic with Sebastian Mullaert (Circle of Live),"The Sayers Club, Los Angeles",2020-04-08,7.0
1,CCVO Thursdays with Kerim Bey & Joe Dismal,"Underground SF, San Francisco",2020-04-09,1.0
2,Postponed - Damian Lazarus (Crosstown Rebels),"Public Works, San Francisco",2020-04-09,22.0
3,[CANCELLED] DJ Koze & Floating Points,"1015 Folsom, San Francisco",2020-04-10,37.0
4,Cancelled,"F8 1192 Folsom, San Francisco",2020-04-10,22.0
5,"Desert Dream feat. Walker & Royce, Will Clarke...",Equl Estate,2020-04-10,2.0
6,"Canceled - Om Unit, The Librarian, J:Kenzo by ...","Public Works, San Francisco",2020-04-10,2.0
7,[POSTPONED] Coachella 2020,Empire Polo Club,2020-04-10,23.0
8,DUBLAB x Doom Trip present: The Doom Mix IV Re...,"TBA - Downtown LA, Los Angeles",2020-04-10,2.0
9,Rich Medina presents Home,"Resident, Los Angeles",2020-04-10,1.0


In [6]:
def scrape_event2(events_page_url):
    soup = bs(requests.get(events_page_url).content, 'html.parser')
    event_container = soup.find('ul', id="items").findAll('article')
    dates = []
    names = []
    venues = []
    attendees = []
    for index in range(len(event_container)):
        date = event_container[index].find('time').text.strip()[:-6]
        dates.append(date)
        name = event_container[index].find('h1',
                    class_='event-title').find('a').text.strip()
        names.append(name)
        venue = event_container[index].find('h1',
                    class_='event-title').find('span').text.strip()[3:]
        venues.append(venue)
        try: attendee = event_container[index].find('p', 
                    class_='attending').find('span').text
        except: 
            attendee = 0
        attendees.append(attendee)
    df = pd.DataFrame([names, venues, dates, attendees]).transpose()
    df.columns = ['Event_Name','Venue', 'Event_Date', 'Number_of_Attendees']
    return df
        

In [7]:
scrape_event2(url)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Clinic with Sebastian Mullaert (Circle of Live),"The Sayers Club, Los Angeles",2020-04-08,7
1,CCVO Thursdays with Kerim Bey & Joe Dismal,"Underground SF, San Francisco",2020-04-09,1
2,Postponed - Damian Lazarus (Crosstown Rebels),"Public Works, San Francisco",2020-04-09,22
3,[CANCELLED] DJ Koze & Floating Points,"1015 Folsom, San Francisco",2020-04-10,37
4,Cancelled,"F8 1192 Folsom, San Francisco",2020-04-10,22
5,"Desert Dream feat. Walker & Royce, Will Clarke...",Equl Estate,2020-04-10,2
6,"Canceled - Om Unit, The Librarian, J:Kenzo by ...","Public Works, San Francisco",2020-04-10,2
7,[POSTPONED] Coachella 2020,Empire Polo Club,2020-04-10,23
8,DUBLAB x Doom Trip present: The Doom Mix IV Re...,"TBA - Downtown LA, Los Angeles",2020-04-10,2
9,Rich Medina presents Home,"Resident, Los Angeles",2020-04-10,1


## Write a Function to Retrieve the URL for the Next Page

In [9]:
def next_page(url):
    soup = bs(requests.get(url).content, 'html.parser')
    url_container = soup.find('li', id = 'liNext2')
    next_page_ref = url_container.find('a').attrs['href']
    next_page_url = "https://www.residentadvisor.net" + next_page_ref
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [16]:
dfs = []
total_rows = 0
cur_url = "https://www.residentadvisor.net/events"
while total_rows <= 150:
    df = scrape_event2(cur_url)
    dfs.append(df)
    total_rows += len(df)
    cur_url = next_page(cur_url)
df = pd.concat(dfs)
df = df.iloc[:150]
print(len(df))
df.head()
#Your code here

150


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Clinic with Sebastian Mullaert (Circle of Live),"The Sayers Club, Los Angeles",2020-04-08,7
1,CCVO Thursdays with Kerim Bey & Joe Dismal,"Underground SF, San Francisco",2020-04-09,1
2,Postponed - Damian Lazarus (Crosstown Rebels),"Public Works, San Francisco",2020-04-09,22
3,[CANCELLED] DJ Koze & Floating Points,"1015 Folsom, San Francisco",2020-04-10,37
4,Cancelled,"F8 1192 Folsom, San Francisco",2020-04-10,22


In [17]:
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Clinic with Sebastian Mullaert (Circle of Live),"The Sayers Club, Los Angeles",2020-04-08,7
1,CCVO Thursdays with Kerim Bey & Joe Dismal,"Underground SF, San Francisco",2020-04-09,1
2,Postponed - Damian Lazarus (Crosstown Rebels),"Public Works, San Francisco",2020-04-09,22
3,[CANCELLED] DJ Koze & Floating Points,"1015 Folsom, San Francisco",2020-04-10,37
4,Cancelled,"F8 1192 Folsom, San Francisco",2020-04-10,22
...,...,...,...,...
6,Pink Block – Pride Saturday 2020,"The Great Northern, San Francisco",2020-06-28,1
0,Ayli 10 Year Kickoff: Joy Orbison B2B Ben UFO ...,"Public Works, San Francisco",2020-07-10,12
1,[RESCHEDULED] Atish All Night,"TBA - Los Angeles, Los Angeles",2020-07-11,236
0,Chemical Surf - U.S. Tour,"Avalon Hollywood, Los Angeles",2020-07-17,2


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!