# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [14]:
# Load the https://www.residentadvisor.net/events page in your browser.
from bs4 import BeautifulSoup
import requests

html_page = requests.get('https://www.residentadvisor.net/events/us/georgia/month/2019-08-23') #no events nowadays due to Covid-19
soup = BeautifulSoup(html_page.content, 'html.parser')


## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [44]:
# Open the inspect element feature in your browser
events = soup.findAll('h1', {'class': 'event-title'})
url = events[0].find('a').attrs['href']


main = 'https://www.residentadvisor.net'

In [42]:
# requests into event page
event_page = requests.get(main+url)
event_soup = BeautifulSoup(event_page.content, 'html.parser')

Event_Name = event_soup.find('div', {'id': 'breadcrumb', 'class': 'clearfix'}).nextSibling.nextSibling.text
Number_of_attendees = int(event_soup.find('h1',{'id':'MembersFavouriteCount'}).text.strip())
Venue = event_soup.findAll('a', {'class': 'cat-rev'})[1].text + ', ' + event_soup.findAll('a', {'class': 'cat-rev'})[1].nextSibling.nextSibling
Event_Date = event_soup.findAll('a', {'class': 'cat-rev'})[0].text

print(Number_of_attendees, Venue, Event_Date, Event_Name)


9 Ravine,  1021 Peachtree St NE, Atlanta, GA 30309 23 Aug 2019 Bonobo (DJ Set) & Matthew Dear (DJ Set)


In [41]:
event_soup.find('div', {'id': 'breadcrumb', 'class': 'clearfix'}).nextSibling.nextSibling.text

'Bonobo (DJ Set) & Matthew Dear (DJ Set)'

In [10]:
events[0].find('a').attrs['title']

'Event details of Bonobo (DJ Set) & Matthew Dear (DJ Set)'

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [154]:

# Open the inspect element feature in your browser
main = 'https://www.residentadvisor.net/events/1303824'
event_page = requests.get(main)
event_soup = BeautifulSoup(event_page.content, 'html.parser')

name = event_soup.find('div', {'id': 'breadcrumb', 'class': 'clearfix'}).nextSibling.nextSibling.text
attendees = int(event_soup.find('h1',{'id':'MembersFavouriteCount'}).text.strip())


In [155]:
event_soup.find('ul', {'class': 'clearfix'}).findAll('div')[0].text.split()[0]
items = event_soup.find('ul', {'class': 'clearfix'})

In [174]:
items.findAll('li', {'class':'wide'})[0].text


'Date /Sat, 14 Sep 2019  - Sun, 15 Sep 201918:00 - 18:00'

In [197]:
items = event_soup.find('ul', {'class': 'clearfix'})
items.findAll('li').find('/div')

# items[0].nextSibling.nextSibling.nextSibling
# items[1].nextSibling
# list_items = [item.nextsibling for item in items[0:2]]
# list_items

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

In [246]:
def scrape_events(events_page_url):
    # Open the inspect element feature in your browser
    main = 'https://www.residentadvisor.net'
    html_page = requests.get(events_page_url) #no events nowadays due to Covid-19
    soup = BeautifulSoup(html_page.content, 'html.parser')
    events = soup.findAll('h1', {'class': 'event-title'}) # Scrapes all the events from given page
    
    # initialize data
    names = []
    attendees = []
    venues = []
    addresses = []
    dates = []
    
    for event in events:
        url = event.find('a').attrs['href']
        event_page = requests.get(main+url)
        event_soup = BeautifulSoup(event_page.content, 'html.parser')
        
        try:
            name = event_soup.find('div', {'id': 'breadcrumb', 'class': 'clearfix'}).nextSibling.nextSibling.text
            attendee = int(event_soup.find('h1',{'id':'MembersFavouriteCount'}).text.strip())
            items = event_soup.find('ul', {'class': 'clearfix'}).findAll('div')
            date = event_soup.find('ul', {'class': 'clearfix'}).find('a').text
            venue = items[1].nextSibling
            address = items[1].nextSibling.nextSibling.nextSibling
#             venue = event_soup.findAll('a', {'class': 'cat-rev'})[1].text + ', ' + event_soup.findAll('a', {'class': 'cat-rev'})[1].nextSibling.nextSibling
#             date = event_soup.findAll('a', {'class': 'cat-rev'})[0].text
        except Exception as e:
            print(name, date, venue, address, e)
            break
        names.append(name)
        attendees.append(attendee)
        venues.append(venue)
        addresses.append(address)
        dates.append(date)
        
    df = pd.DataFrame([names, venues, addresses, dates, attendees]).transpose()

    df.columns = ["Event_Name", "Venue", "Addresses", "Event_Date", "Number_of_Attendees"]
    return df


In [199]:
#checking scrape_events functiuon
event_page = 'https://www.residentadvisor.net/events/us/georgia/month/2019-08-23'
scrape_events(event_page)

Unnamed: 0,Event_Name,Venue,Addresses,Event_Date,Number_of_Attendees
0,Bonobo (DJ Set) & Matthew Dear (DJ Set),[Ravine],"1021 Peachtree St NE, Atlanta, GA 30309",23 Aug 2019,9
1,"The Field, Anticipation, Soft Talk",[Drunken Unicorn],"736 Ponce De Leon, Place Northeast; Atlanta, ...",23 Aug 2019,2
2,Official [Rooftop] Pre-Party: Zemya Fest 2019,[The Rooftop],421 Edgewood Ave. Atlanta GA 30312 USA,24 Aug 2019,48
3,Felix Jaehn,[Ravine],"1021 Peachtree St NE, Atlanta, GA 30309",24 Aug 2019,0
4,A Night with Ron Trent,[Crazy Atlanta],"182 Courtland St NE Atlanta, GA 30303 United ...",30 Aug 2019,8
5,Distinctive Welcomes Brett Dancer,[The Sound Table],"483 Edgewood Avenue SE; Atlanta, GA 30312; Un...",30 Aug 2019,2
6,Laura Indorf and Corey Jackson,[Midcity Cafe],"850 W Peachtree St NW, Atlanta, Georgia 30308",30 Aug 2019,0
7,People of Earth 4 Year Anniversary feat. Alton...,Pal's Lounge,"254 Auburn Ave NE, Atlanta, GA 30303",31 Aug 2019,18
8,The Summit 2019,[Murmur Gallery],100 Broad St SW,31 Aug 2019,7
9,Crazy Con: Dragon Con Afterparty,[Crazy Atlanta],"182 Courtland St NE Atlanta, GA 30303 United ...",31 Aug 2019,4


## Write a Function to Retrieve the URL for the Next Page

In [229]:
main = 'https://www.residentadvisor.net/events/us/georgia/month/2019-09-23'
main_page = requests.get(main)
main_soup = BeautifulSoup(main_page.content, 'html.parser')
next_button = main_soup.find('li', {'id': 'liNext2'})


In [230]:
next_button.find('a').attrs['href']

'/events/us/georgia/month/2019-10-23'

In [235]:
def next_page(url):
    main = 'https://www.residentadvisor.net'
    current = requests.get(url)
    current_soup = BeautifulSoup(current.content, 'html.parser')
    next_button = current_soup.find('li', {'id': 'liNext2'})
    return main + next_button.find('a').attrs['href']

In [245]:
#testing
next_page('https://www.residentadvisor.net/events/us/georgia/month/2019-10-23')

'https://www.residentadvisor.net/events/us/georgia/month/2019-11-23'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [239]:
import time

In [253]:
#Your code here
dfs = []
nrows = 0
cur_url = 'https://www.residentadvisor.net/events/us/georgia/month/2017-08-23'
a = 0
while nrows < 1000:
    print('Scraping: ', cur_url)
    df = scrape_events(cur_url)
    nrows += len(df)
    dfs.append(df)
    cur_url = next_page(cur_url)
    time.sleep(0.2)
df = pd.concat(dfs)
df.head()

Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2017-08-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2017-09-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2017-10-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2017-11-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2017-12-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-01-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-02-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-03-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-04-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-05-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-06-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/month/2018-07-23
Scraping:  https://www.residentadvisor.net/events/us/georgia/mon

KeyError: 'href'

In [254]:
df = pd.concat(dfs)
df.head()

Unnamed: 0,Event_Name,Venue,Addresses,Event_Date,Number_of_Attendees
0,Tocayo (Open to Close),[The Alley Cat Music Club],"182 Courtland St NE Atlanta,GA 30312",25 Aug 2017,3
1,Red Bull Culture Clash Atlanta 2017,[787 Windsor],"787 Windsor St SW, Atlanta, GA 30315, USA",25 Aug 2017,1
2,Official Pre - Party: Zemya Fest 2017 [Open Air],TBA - Atlanta,To Be Announced,26 Aug 2017,77
3,"Shabazz Palaces, Porter Ray, The Morkestra",[Terminal West],"887 West Marietta St NW, Atlanta, GA 30318, U...",26 Aug 2017,1
4,The Atlanta Weekender 2017,[The Sound Table],"483 Edgewood Avenue SE; Atlanta, GA 30312; Un...",31 Aug 2017,6


In [255]:
df.sort_values(by='Number_of_Attendees', ascending=False)

Unnamed: 0,Event_Name,Venue,Addresses,Event_Date,Number_of_Attendees
18,Zemya Fest 2017,TBA - Atlanta,To Be Announced,9 Sep 2017,132
22,Sunset: Friends & Family [Invite Only],TBA - Atlanta,To Be Announced,10 Mar 2018,79
16,Sunset: Friends & Family [Invite Only],TBA - Atlanta,To Be Announced,10 Mar 2018,79
2,Official Pre - Party: Zemya Fest 2017 [Open Air],TBA - Atlanta,To Be Announced,26 Aug 2017,77
0,Sunset: Friends & Family [Invite Only],TBA - Atlanta,To Be Announced,23 Mar 2019,77
...,...,...,...,...,...
3,"Space Jesus, Buku, Of the Trees, Huxley Anne",[Terminal West],"887 West Marietta St NW, Atlanta, GA 30318, U...",10 Jan 2018,0
1,Getter,[Ravine],"1021 Peachtree St NE, Atlanta, GA 30309",24 Jan 2020,0
0,NYE Warehouse Afterparty by Alley Cat Music,[Okami],"645 Shelton Ave SW Atlanta, Ga 30310",1 Jan 2018,0
11,Wish List at The Social Holiday,[Lava Lounge],"57 13th Street NE; Atlanta, GA 30361; United ...",16 Dec 2017,0


In [252]:
df['Event_Name'].value_counts().head(20)

Common Circuits Festival                                      3
Psycho Disco                                                  3
Imagine Music Festival                                        3
Laura Indorf and Corey Jackson                                3
Swank                                                         3
Space Jesus, Tsuruda, Tiedye Ky, Onhell                       2
Fisher -Outdoor Block Party presented by Catch and Release    2
Proper Taste: All Night Long                                  2
Project B. Sunset Rooftop Party                               2
Records with Friends                                          2
Mersiv, Fryar, Kozmic, Andy Bruh                              2
Stroke                                                        2
Zemya Fest 2019                                               2
Party Favor: Layers Envisioned                                1
Goopsteppa, Supertask, Ill Chill, Dollarsine                  1
Corona Electric Beach: Road To EDC Orlan

In [256]:
df.to_csv('scraped.csv')  

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!