## Web Scraping the Ryman Calendar

This notebook uses BeautifulSoup to create a DataFrame of upcoming events at the Ryman. This information is available at https://ryman.com/events/, which splits the events across multiple pages.

### Import, Verify and Format the HTML Code

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as BS

In [None]:
URL = 'https://ryman.com/events/list/?tribe_event_display=list&tribe_paged=1'
ryman_html = requests.get(URL)

**Verify that the URL worked and the code read in correctly.**

In [None]:
print('Status Code:',ryman_html.status_code)
print('Status Type:',type(ryman_html))
print('Code:')
print(ryman_html.text)

**Beautify the html code using a formatter.**

In [None]:
ryman_events_html = BS(ryman_html.text)
print(ryman_events_html.prettify())

### Begin Webscraping

**Using a webpage inspector, identify the html tags and extract the information pertaining to headliners, openers (if any), dates and times, and ticket prices. Loop over all event pages; for each event page, create and populate separate lists for each of the mentioned items. Create and populate a pandas DataFrame with this information.**

**First add headliners, opening acts, dates and times.**

In [None]:
# Declare the common event webpage url
URL = 'https://ryman.com/events/list/?tribe_event_display=list&tribe_paged='

# Initate an empty DataFrame to populate with each page's event info
events_df = pd.DataFrame(columns=['Event', 'Opener', 'Date', 'Time'])

# Initiate variables to control iteration over the main loop
page_valid = True
page = 1

# Loop through each event page on the Ryman website
while page_valid:
    
    # Read in the url's html code
    ryman_html = requests.get(URL + str(page))
    
    # Terminate if the url is not valid
    if ryman_html.status_code == 404 | ryman_html.status_code == 400:
        page_valid = False
        break
        
    # Beautify the html code
    ryman_events_html = BS(ryman_html.text)
      
    # Terminate if the url/page has no events
    events = ryman_events_html.findAll('div', attrs={'id':'primary', 'class':'tribe-events-loop'})
    if len(events) == 0:
        page_valid = False
        break
    
    # Initiate empty lists to populate with the current page's event info
    event_list = []
    time_list = []
    date_list = []
    opener_list = []
    
    # Loop over each event's info
    # Event info is wrapped in: <div class='tribe-beside-image'>
    for info in ryman_events_html.findAll('div', attrs={'class':'tribe-beside-image'}):
        
        # Titles are wrapped in: <a class="tribe-event-url>"
        event_list.append(info.find('a', attrs={'class':'tribe-event-url'}).get('title'))
        
        # Dates and times are wrapped in: <time>
        date_time_info = info.find('time').text.upper()
        dt_split = date_time_info.split(' AT ')
        time_list.append(dt_split[1])
        date_list.append(dt_split[0][dt_split[0].find(',') + 2:].title())

        # OPENERS are wrapped in <span class='opener'>
        # HOWEVER, some artists have no openers and some artists have an extra span tag
        span_list = info.findAll('span', attrs={'class':'opener'})
        if(len(span_list) == 0):
            opener_list.append('None')
        elif(len(span_list) > 1):
            opener_list.append(span_list[1].text)
        else:
            opener_list.append(span_list[0].text)
    
    # Populate the dataframe with the new list values
    events_df = pd.concat([events_df, pd.DataFrame({'Event':event_list, 'Opener':opener_list, 'Date': date_list, 'Time':time_list})]).reset_index(drop=True)
    page += 1

##### Have a look at the events!

In [None]:
events_df

**Now add ticket prices (available on a separate webpage through the "MORE INFO" link.)**

In [None]:
# Declare the common event webpage url
URL = 'https://ryman.com/events/list/?tribe_event_display=list&tribe_paged='

# Initate an empty DataFrame to populate with each page's event info
events_df = pd.DataFrame(columns=['Event', 'Opener', 'Date', 'Time', 'Ticket Prices'])

# Initiate variables to control iteration over the main loop
page_valid = True
page = 1


# Loop through each event page on the Ryman website
while page_valid:
    # Read in the url's html code
    ryman_html = requests.get(URL + str(page))
    
    # Terminate if the url is not valid
    if ryman_html.status_code == 404 | ryman_html.status_code == 400:
        page_valid = False
        break
        
    # Beautify the html code
    ryman_events_html = BS(ryman_html.text)
      
    # Terminate if the url/page has no events
    events = ryman_events_html.findAll('div', attrs={'id':'primary', 'class':'tribe-events-loop'})
    if len(events) == 0:
        page_valid = False
        break
    
    # Initiate empty lists to populate with the current page's event info
    event_list = []
    opener_list = []
    date_list = []
    time_list = []
    price_list = []
    
    # Loop over each event
    # Event info is wrapped in: <div class='tribe-beside-image'>
    for info in ryman_events_html.findAll('div', attrs={'class':'tribe-beside-image'}):
        
        
        # TITLES are wrapped in: <a class="tribe-event-url>"
        event_list.append(info.find('a', attrs={'class':'tribe-event-url'}).get('title'))
        
        
        # DATES and TIMES are wrapped in: <time>
        date_time_info = info.find('time').text.upper()
        dt_split = date_time_info.split(' AT ')
        time_list.append(dt_split[1])
        date_list.append(dt_split[0][dt_split[0].find(',') + 2:].title())

        
        # OPENERS are wrapped in <span class='opener'>
        # However, some artists have no openers and some artists have an extra span tag
        span_list = info.findAll('span', attrs={'class':'opener'})
        if(len(span_list) == 0):
            opener_list.append('None')
        elif(len(span_list) > 1):
            opener_list.append(span_list[-1].text)
        else:
            opener_list.append(span_list[0].text)
               
                
        # TICKET information is in a separate url available in a link wrapped in: <a class='smallblackbutton'>
        # However, some events may be cancelled or sold out
        ticket_url = info.find('a', attrs={'class':'smallblackbutton'}).get('href')
        ticket_html = requests.get(ticket_url)

        # Check if the ticket url is valid
        if ticket_html.status_code == 404 | ticket_html.status_code == 400:
            price_list.append('Ticket Prices Not Available')
        else:
            # Beautify ticket url html code
            ticket_info_html = BS(ticket_html.text)

            # Ticket info is wrapped in: <div class='ticketdetails'>
            ticket_info = ticket_info_html.find('div', attrs={'class':'ticketdetails'})
            
            # Check ticket status (sold out, canceled, missing or available) and update the list accordingly
            ticket_status = ticket_info.find('strong', attrs={'class':'show-status-label'})
            if ticket_status is not None:
                if (ticket_status.text == 'sold out') | (ticket_status.text == 'canceled'):
                    price_list.append(ticket_status.text.upper())
                else:
                    price_list.append(ticket_info.find('p', attrs={'class':'theprices'}).text)
            else:
                price = ticket_info.find('p', attrs={'class':'theprices'})
                # Check if price is missing
                if price is None:
                    price_list.append('Ticket Prices Not Available')
                else:
                    price_list.append(ticket_info.find('p', attrs={'class':'theprices'}).text)
                           
    events_df = pd.concat([events_df, pd.DataFrame({'Event':event_list, 'Opener':opener_list, 'Date': date_list, 'Time':time_list, 'Ticket Prices':price_list})]).reset_index(drop=True)
    page += 1

##### Have a look at the events!

In [None]:
print(events_df)