# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

In [3]:
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')

In [31]:
content = soup.find('div', class_= 'strip slide small')
#content 

In [80]:
event_name = content.findAll('h1')
event_names_list = []
for event in event_name:
    event_names_list.append(event.text.replace('[RESCHEDULED]', '').strip())

event_names_list


['KTPM presents / Tribal Disco',
 'Rave Revue: Living Your Best Afterlife',
 'Sasha & John Digweed at The Brooklyn Mirage',
 'Nora En Pure presents Purified',
 'Rjd2',
 'Jellybean Rocks The House ~ The Boat Ride with Jellybean Benitez',
 'Amon Tobin presents: Two Fingers',
 'Autograf, Zack Martino']

In [79]:
event_venue = content.findAll('p', class_= 'copy nohide')
venues_list = []
for venue in event_venue:
    venues_list.append(venue.text.strip())

venues_list

['Rebel Cafe & Garden',
 'Secret Loft',
 'Brooklyn Mirage',
 'Brooklyn Mirage',
 'Elsewhere',
 'Circle Line Cruises',
 'Elsewhere',
 'Elsewhere']

In [100]:
event_dates = content.findAll('article', class_= 'highlight-top')
event_date_list = []
for date in event_dates:
    event_date_list.append(date.text.split('\n')[1])

event_date_list

['Sat, 25 Jul 2020',
 'Sat, 25 Jul 2020',
 'Fri, 31 Jul 2020',
 'Sat, 1 Aug 2020',
 'Fri, 14 Aug 2020',
 'Sat, 15 Aug 2020',
 'Thu, 20 Aug 2020',
 'Sat, 22 Aug 2020']

In [78]:
attendees = content.findAll('p', class_= 'counter nohide')
attendees_list = []
for attendee in attendees:
    attendees_list.append(attendee.text.strip())

attendees_list

['7 attending',
 '2 attending',
 '56 attending',
 '71 attending',
 '13 attending',
 '7 attending',
 '20 attending',
 '24 attending']

In [104]:
event_information = {'event_name': event_names_list, 
                      'venue': venues_list, 
                      'date' : event_date_list,
                      'attendees' : attendees_list
                    }

df = pd.DataFrame(event_information)
df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
df.head()

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,KTPM presents / Tribal Disco,Rebel Cafe & Garden,"Sat, 25 Jul 2020",7 attending
1,Rave Revue: Living Your Best Afterlife,Secret Loft,"Sat, 25 Jul 2020",2 attending
2,Sasha & John Digweed at The Brooklyn Mirage,Brooklyn Mirage,"Fri, 31 Jul 2020",56 attending
3,Nora En Pure presents Purified,Brooklyn Mirage,"Sat, 1 Aug 2020",71 attending
4,Rjd2,Elsewhere,"Fri, 14 Aug 2020",13 attending


## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [107]:
def scrape_events(events_page_url):
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    event_name = content.findAll('h1')
    event_names_list = []
    for event in event_name:
        event_names_list.append(event.text.replace('[RESCHEDULED]', '').strip())
    
    event_venue = content.findAll('p', class_= 'copy nohide')
    venues_list = []
    for venue in event_venue:
        venues_list.append(venue.text.strip())
    
    event_dates = content.findAll('article', class_= 'highlight-top')
    event_date_list = []
    for date in event_dates:
        event_date_list.append(date.text.split('\n')[1])
    
    attendees = content.findAll('p', class_= 'counter nohide')
    attendees_list = []
    for attendee in attendees:
        attendees_list.append(attendee.text.strip())
    
    event_information = {'event_name': event_names_list, 'venue': venues_list, 
                         'date' : event_date_list, 'attendees' : attendees_list}
     
    df = pd.DataFrame(event_information)
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    return df

In [108]:
scrape_events('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,KTPM presents / Tribal Disco,Rebel Cafe & Garden,"Sat, 25 Jul 2020",7 attending
1,Rave Revue: Living Your Best Afterlife,Secret Loft,"Sat, 25 Jul 2020",2 attending
2,Sasha & John Digweed at The Brooklyn Mirage,Brooklyn Mirage,"Fri, 31 Jul 2020",56 attending
3,Nora En Pure presents Purified,Brooklyn Mirage,"Sat, 1 Aug 2020",71 attending
4,Rjd2,Elsewhere,"Fri, 14 Aug 2020",13 attending
5,Jellybean Rocks The House ~ The Boat Ride with...,Circle Line Cruises,"Sat, 15 Aug 2020",7 attending
6,Amon Tobin presents: Two Fingers,Elsewhere,"Thu, 20 Aug 2020",20 attending
7,"Autograf, Zack Martino",Elsewhere,"Sat, 22 Aug 2020",24 attending


## Write a Function to Retrieve the URL for the Next Page

In [115]:
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')

In [121]:
button_location = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
button_location

'/events/us/newyork/week/2020-08-01'

In [126]:
def next_page(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    button_url = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']   
    next_page_url = 'https://www.residentadvisor.net{}'.format(button_url)
    return next_page_url

In [123]:
next_page('https://www.residentadvisor.net/events/us/newyork')

'https://www.residentadvisor.net/events/us/newyork/week/2020-08-01'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [128]:
#My code worked with a lesser number.  My thought is that I pulled the information from the wrong event listing area
#...which is what we did during study group.  I'll redo this the right way at some point.

num_events = 0
url = 'https://www.residentadvisor.net/events/us/newyork'

while num_events < 10:
    df = pd.DataFrame(scrape_events(url))
    url = next_page(url)
    num_events += 1
    time.sleep(3)
    
df


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,KTPM presents / Tribal Disco,Rebel Cafe & Garden,"Sat, 25 Jul 2020",7 attending
1,Rave Revue: Living Your Best Afterlife,Secret Loft,"Sat, 25 Jul 2020",2 attending
2,Sasha & John Digweed at The Brooklyn Mirage,Brooklyn Mirage,"Fri, 31 Jul 2020",56 attending
3,Nora En Pure presents Purified,Brooklyn Mirage,"Sat, 1 Aug 2020",71 attending
4,Rjd2,Elsewhere,"Fri, 14 Aug 2020",13 attending
5,Jellybean Rocks The House ~ The Boat Ride with...,Circle Line Cruises,"Sat, 15 Aug 2020",7 attending
6,Amon Tobin presents: Two Fingers,Elsewhere,"Thu, 20 Aug 2020",20 attending
7,"Autograf, Zack Martino",Elsewhere,"Sat, 22 Aug 2020",24 attending


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!