# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.
from bs4 import BeautifulSoup
import requests
import datetime
import pandas as pd

html_page = requests.get('https://www.residentadvisor.net/events')
soup = BeautifulSoup(html_page.content, 'html.parser')

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser
# soup.prettify

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [None]:
events_column = soup.find('div', class_="fl col4")
event_listings = events_column.find_all('article')

In [None]:
events = events_column.find_all('h1')

In [None]:
# Retrieve event names
event_names = [e.find('a').text for e in events]
len(event_names)

In [None]:
# Retrieve Venue
venues = [e.find('span').text[3:] for e in events]
len(venues)

In [None]:
# Retrieve Dates
event_dates = [e.text[:10] for e in events_column.find_all('time')]
len(event_dates)

In [None]:
# Retrieve number of attendees
attendees = [int(e.find('span').text) for e in events_column.find_all('p', class_='attending')]
len(attendees)

In [None]:
event_names = []
venues = []
event_dates = []
attendees = []

for event in event_listings:
    event_names.append(event.find('h1').find('a').text)
    venues.append(event.find('h1').find('span').text[3:])
    event_dates.append(event.find('time').text[:10])
    
    # Check if event has been cancelled
    if event.find('p', class_='attending') is not None:
        attendees.append(int(event.find('p', class_='attending').find('span').text))
    else:
        attendees.append(0)

print(len(event_names), len(venues), len(event_dates), len(attendees))

In [None]:
def scrape_events(num_events, events_page_url, names=[], venues=[], dts=[], attendees=[]):
    
    # Load the events page
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    # Retrieve the event listings
    events_column = soup.find('div', class_="fl col4")
    event_listings = events_column.find_all('article')
    
    # Loop through each event and retrieve name, venue, date and number of attendees
    for event in event_listings:
        if len(names) < num_events:
            names.append(event.find('h1').find('a').text)
            venues.append(event.find('h1').find('span').text[3:])
            dts.append(event.find('time').text[:10])

            # Check if event has been cancelled
            if event.find('p', class_='attending') is not None:
                attendees.append(int(event.find('p', class_='attending').find('span').text))
            else:
                attendees.append(0)
    
    return names, venues, dts, attendees

In [None]:
# names = []
# venues = []
# dts = []
# attendees = []

# test = scrape_events(20, 'https://www.residentadvisor.net/events', names, venues, dts, attendees)
# print(len(names))
# display(test)

## Write a Function to Retrieve the URL for the Next Page

In [None]:
def next_page(url):
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    # Retrieve the href for the next arrow & concatenate to main page url
    url_ext = soup.find('li', class_='but arrow-right right').find('a').attrs['href']
    next_page_url = 'https://www.residentadvisor.net/' + url_ext
    
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [None]:
#Your code here
def scrape_x_events(num_events, url, names=[], venues=[], dts=[], attendees=[]):
    
    if len(names) < num_events:
        scrape_events(num_events, url, names, venues, dts, attendees)
        url = next_page(url)
        return scrape_x_events(num_events, url, names, venues, dts, attendees)
    else:
        df = pd.DataFrame([names, venues, dts, attendees]).transpose()
        df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    return df

In [None]:
url = 'https://www.residentadvisor.net/events'
# Have reduced to 200 events for now as minimal events in calender due to COVID-19
num_events = 200
names = []
venues = []
dts = []
attendees = []
num_pages = 0

event_table = scrape_x_events(num_events, url, names, venues, dts, attendees)

In [None]:
event_table = event_table.sort_values(by=['Number_of_Attendees', 'Event_Date'], ascending=[False, True])
event_table.reset_index(drop=True, inplace=True)
event_table

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!