# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
from bs4 import BeautifulSoup
import requests
import re
from word2number import w2n
import pandas as pd

In [4]:
def scrape_events(events_page_url):
    #Your code here
    cols = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    df = pd.DataFrame(columns=cols)
    #     df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    event_listing = soup.find('div', id="event-listing")
    #print(event_listing)
    event_items = event_listing.find_all('article', class_="event-item")
    #print(event_items)
    
    for event_item in event_items:
        #print(event_item, "\n")
        
        event_data = {
            'Event_Name': None
            , 'Venue': None
            , 'Event_Date': None
            , 'Number_of_Attendees': None
        }
        
        # date
        event_time = event_item.find('time')
        if event_time is not None: # get all essential info for this event
            s_datetime = event_time.attrs['datetime']
            #print(f"datetime: {s_datetime}")
            event_data['Event_Date'] = s_datetime
            
        # name and venue
        event_title = event_item.find('h1', class_='event-title')
        if event_title is not None:
            event_link = event_title.find('a', itemprop='url')
            s_name = event_link.string
            #print(f"eventname: {s_name}")
            event_data['Event_Name'] = s_name
            event_venue_container = event_title.find('span')
            if event_venue_container is not None:
                event_venue_link = event_venue_container.find('a')
                if event_venue_link is not None:
                    s_venue = event_venue_link.string
                else:
                    event_venue_span = event_venue_container.find('span')
                    if event_venue_span is not None:
                        s_venue = event_venue_span.string
                #print(f"venue: {s_venue}")
                event_data['Venue'] = s_venue
                
        event_attendees_container = event_item.find('p', class_='attending')
        if event_attendees_container is not None:
            event_attendees_span = event_attendees_container.find('span')
            if event_attendees_span is not None:
                s_attending = event_attendees_span.string
                #print(f"attending: {s_attending}")
                event_data['Number_of_Attendees'] = s_attending
                
        df = df.append([event_data], ignore_index=True, sort=False)
                
        #print("\n\n")
            
    return df

In [5]:
scrape_events('https://www.residentadvisor.net/events')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,The Aliens Garden with Leonardo Gonnelli,Barcode NJ,2020-01-31T00:00,5.0
1,Society presents: Chus & Ceballos,Society Lounge,2020-02-01T00:00,6.0
2,Eleusinian Planes: Beckoning the Lesser Mysteries,TBA - New Jersey,2020-02-01T00:00,3.0
3,1st Saturday's Club Classics,Il Portico Ristorante,2020-02-01T00:00,


## Write a Function to Retrieve the URL for the Next Page

In [6]:
def next_page(url):
    #Your code here
    next_page_url = None
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    page_items = soup.find('div', class_="page-items")
    #print(page_items)
    
    next_page_link = page_items.find('a', attrs={'ga-event-action': 'Next '})
    next_page_url = next_page_link.attrs['href']
    
    return next_page_url

In [7]:
base_url = 'https://www.residentadvisor.net/events'
print(f"next page url for {base_url}: {next_page(base_url)}")

next page url for https://www.residentadvisor.net/events: /events/us/newjersey/week/2020-02-03


## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [8]:
#Your code here
import urllib.parse

def events_by_area(area):
    cols = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    df = pd.DataFrame(columns=cols)
    #     df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    area = urllib.parse.quote(area)
    url = f"https://www.residentadvisor.net/search.aspx?searchstr={area}&section=events&titles=0"
    parsed_url = urllib.parse.urlparse(url)
    #print(parsed_url, "\n\n")
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main = soup.find('main')
    events_container = main.find('div', class_="events")
    events_list_container = events_container.find('ul', class_='list')
    event_items = events_list_container.find_all('li')
    #print(event_items)
    
    for idx, event_item in enumerate(event_items):
        #print(event_item, "\n")
        
        event_data = {
            'Event_Name': None
            , 'Venue': None
            , 'Event_Date': None
            , 'Number_of_Attendees': None
        }
            
        # name
        event_link = event_item.find('a')
        s_name = event_link.string
        #print(f"eventname: {s_name}")
        event_data['Event_Name'] = s_name
        
        #venue
        s_event_url = 'https://' + parsed_url.netloc + event_link['href']
        #print(f"even link: {s_event_url}")
        event_html_page = requests.get(s_event_url)
        event_soup = BeautifulSoup(event_html_page.content, 'html.parser')
        event_section = event_soup.find('section', class_='contentDetail')
        #print(event_section)
        event_detail = event_section.find('aside', id='detail')
        event_details_container = event_detail.find('ul', class_='clearfix')
        event_details = event_details_container.find_all('li')
        event_venue = event_details[1]
        #print(event_venue)
        event_venue_link = event_venue.find('a')
        s_venue = event_venue_link.string
        if s_venue is None:
            s_venue = event_venue.contents[1]
        #print(f"venue: {s_venue}")
        event_data['Venue'] = s_venue
        
        attending = event_section.find('h1', id='MembersFavouriteCount')
        s_attending = attending.string.strip()
        #print(f"attending: {s_attending}")
        event_data['Number_of_Attendees'] = int(s_attending)
        
        # date
        event_time = event_item.find('span')
        s_datetime = event_time.string
        prefix = 'on '
        if s_datetime[0:len(prefix)] == prefix:
            s_datetime = s_datetime[len(prefix):]
        #print(f"datetime: {s_datetime}")
        event_data['Event_Date'] = s_datetime

        #print("\n")
        
        df = df.append([event_data], ignore_index=True, sort=False)
            
    return df

In [9]:
df = events_by_area('California').sort_values(by='Number_of_Attendees', ascending=False)
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
110,Trikk Curates: Trikk 4 Hour Set and Solar,Mick's Garage,23 Nov 2019,384
26,𝐁𝐄𝐍 𝐒𝐈𝐌𝐒 ⑊ 𝐓𝐑𝐔𝐍𝐂𝐀𝐓𝐄 ⑊ 𝐃𝐄𝐒𝐍𝐀,Secret Location - Brooklyn,08 Feb 2020,193
58,hothouse,Warehouse TBA,11 Jan 2020,125
140,Oscar G - The Birthday Event,Elsewhere,08 Nov 2019,122
13,Dirty Epic and DTE present: Paula Temple SF Debut,F8 1192 Folsom,28 Feb 2020,110
...,...,...,...,...
19,"Draag Residency with Dustin Wong & Brin, Night...",The Echoplex,17 Feb 2020,0
33,Energi Wednesdays: Zia & Sippy (18 ),Sky SLC,05 Feb 2020,0
28,Eric Dlux,E11EVEN MIAMI & ROOFTOP,07 Feb 2020,0
29,"Monte Carlo, DJ Mr Grant, DJ Walksalot",111 Minna Gallery,07 Feb 2020,0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!