# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

# Check it out:
### Unfortunately, "residentadvisor.net" does NOT work with this exercise. It seems like the site has been drastically altered from the version in the screenshot above, and BeautifulSoup returns an html page that contains alerts/errors because of security measures detecting the scrape, and keeping it from working properly.
### After checking lots of other sites looking for similar layout/functionality for this Lab, and that would allow scraping, I found "concerts50.com" which seems to work pretty well, it has a list of events that can be navigated with page buttons.
### Hopefully this covers it! Since there is no data for Number_Of_Attendees, I scraped the city where the events are taking place.

In [3]:
import re
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
#Exploration; designing/testing function parts
response = requests.get("https://concerts50.com/upcoming-concerts-in-new-york")
soup = BeautifulSoup(response.content, 'html.parser')

In [4]:
next_button = soup.find('a', string='>')
next_button['href']

'/upcoming-concerts-in-new-york/2'

In [5]:
events = soup.find('tbody').findAll('tr')

In [6]:
print(events[0].prettify())

<tr class="row event-item" itemscope="" itemtype="http://schema.org/Event">
 <td class="d-inline-block col-3 col-md-2 text-left date">
  <meta content="2021-03-31" itemprop="startDate"/>
  <span itemprop="location" itemscope="" itemtype="http://schema.org/Place">
   <span content="Daryl's House" itemprop="name">
   </span>
   <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
    <abbr content="Pawling" itemprop="addressLocality">
    </abbr>
    <abbr content="NY" itemprop="addressRegion">
    </abbr>
    <abbr content="US" itemprop="addressCountry">
    </abbr>
   </span>
  </span>
  <span class="d-inline-block">
   Wed,
  </span>
  <span class="event-time">
   6:00 PM
  </span>
  <br/>
  <b class="d-inline-block">
   Mar 31
  </b>
  <br/>
  <span class="d-inline-block text-left">
   2021
  </span>
 </td>
 <td class="col-2 col-md-1 col-lg-1 image">
  <img alt="Chris Raabe" src="/uploads/artist/chris-raabe/xs/image.jpg" style="display: inline-block; bord

In [7]:
'disabled' in soup.find('ul', id='yw0').findAll('li')[6]['class']

False

In [8]:
soup.find('ul', id='yw0').findAll('li')[6].find('a')['href']

'/upcoming-concerts-in-new-york/2'

In [15]:
def get_soup(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    events = soup.find('tbody').findAll('tr')
    return soup, events

In [19]:
def parse_event(event):
    name = event.find('p', itemprop='name').find('a').text
    venue = event.find('span', itemprop='name')['content']
    date = event.find('meta')['content']
    city = event.find('abbr', itemprop="addressLocality")['content']
    return {"Event_Name": name, "Venue": venue, "Event_Date": date, "City": city}

In [17]:
def scrape_events(events_page_url):
    # initialize with the first page of events
    this_soup, this_events = get_soup(events_page_url)
    
    rows = []
    
    # iterate through events on each page, until the last page is found
    while 'disabled' not in this_soup.find('ul', id='yw0').findAll('li')[6]['class']:
        for event in this_events:
            rows.append(parse_event(event))
        next_url = "https://concerts50.com" + this_soup.find('ul', id='yw0').findAll('li')[6].find('a')['href']
        this_soup, this_events = get_soup(next_url)
    
    # Now to append the last page of events
    for event in this_events:
        rows.append(parse_event(event))
    
    #df.columns = ["Event_Name", "Venue", "Event_Date", "City"]
    df = pd.DataFrame.from_dict(rows)
    return df

In [20]:
events_df = scrape_events("https://concerts50.com/upcoming-concerts-in-new-york")

In [22]:
display(events_df.info())
display(events_df.describe())
events_df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786 entries, 0 to 785
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Event_Name  786 non-null    object
 1   Venue       786 non-null    object
 2   Event_Date  786 non-null    object
 3   City        786 non-null    object
dtypes: object(4)
memory usage: 24.7+ KB


None

Unnamed: 0,Event_Name,Venue,Event_Date,City
count,786,786,786,786
unique,651,146,295,56
top,2021 Season Tickets,Daryl's House,2021-12-31,New York
freq,7,101,9,230


Unnamed: 0,Event_Name,Venue,Event_Date,City
0,Chris Raabe,Daryl's House,2021-03-31,Pawling
1,Goodie Mob,Sony Hall,2021-03-31,New York
2,David Powers,Daryl's House,2021-04-01,Pawling
3,Miller & The Other Sinners - RISE Record Release,Tralf,2021-04-01,Buffalo
4,Shawn Colvin,Hart Theatre at the Egg,2021-04-01,Albany
5,"TV Girl with Jordana Tickets (16+ Event, Resch...",Music Hall of Williamsburg,2021-04-01,Brooklyn
6,Lilly Hiatt & The Harmaleighs Tickets (21+ Eve...,Rough Trade NYC,2021-04-01,Brooklyn
7,Auguste and Alden,Daryl's House,2021-04-02,Pawling
8,The Yachtfathers,Tralf,2021-04-02,Buffalo
9,Southside Johnny and the Asbury Jukes Tickets ...,Center For The Arts Of Homer,2021-04-02,Homer


## Write a Function to Retrieve the URL for the Next Page

In [13]:
# This is incorporated into the function above

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [14]:
# There are only(!) 786 listed on this site, and they were all scraped above. They are already sorted by date.

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!