This Jupyter Notebook is the a demonstration of Web Scraping in Python. We are tasked with getting a dataset of the top Broadway shows along with their theatre and price. We can compile our dataset using Web Scraping with the help of the Python libraries BeautifulSoup and Json. We can get our information by webscrapping from [www.broadway.com](www.broadway.com).

In [1]:
#importing libraries


import requests # -> This helps establish a connection to the website
from bs4 import BeautifulSoup
import json # -> This helps with the parsing of specific text on a site
import pandas as pd

In [2]:
url = "https://www.broadway.com"
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

Let's test the reponse below. We want to see a 200. A response of 400 means there is a BAD connection.

In [3]:
response

<Response [200]>

In [4]:
#This is all of the text on the website. Now we just have to do some digging.
scraped


<!DOCTYPE html>

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<title>Broadway Tickets | Broadway Shows | Theater Tickets | Broadway.com</title>
<meta content="width=device-width, minimum-scale=0.1, initial-scale=1" name="viewport"/>
<meta content="The most comprehensive source for Broadway Shows, Broadway Tickets, Off-Broadway, London theater information, Tickets, Gift Certificates, Videos, News &amp; Features, Reviews, Photos." name="description"/>
<meta content="#25262b" name="msapplication-TileColor"/>
<meta content="#1f2023" name="theme-color"/>
<!-- Site Verification -->
<meta content="10C3133083513657F42058B6E919699D" name="msvalidate.01"/>
<meta content="b4c96cd61f1e98c1" name="y_key"/>
<!-- Mixpanel -->
<script type="text/javascript">
                (function(c,a){if(!a.__SV){var b=window;try{var d,m,j,k=b.location,f=k.hash;d=function(a,b){return(m=a

After doing some digging in the above text, we found the place on the site we are looking extract the information on. There is a text box on the homepage that lists the popular shows. We will use this to get the urls of each show to extract the information we're looking for.

For the information below, we will want to use the json library to easily access the dictionary of information we see below.

In [5]:
#This is how you get to the popular show list. 
scraped.find('div', class_='popular-shows__list-items')

<div class="popular-shows__list-items">
<script type="application/ld+json">
        {"@context": "http://schema.org", "@type": "ItemList", "itemListElement": [{"@type": "ListItem", "position": 1, "name": "Moulin Rouge! The Musical", "url": "https://broadway.com/shows/moulin-rouge-musical/"}, {"@type": "ListItem", "position": 2, "name": "Wicked", "url": "https://broadway.com/shows/wicked/"}, {"@type": "ListItem", "position": 3, "name": "The Lion King", "url": "https://broadway.com/shows/the-lion-king/"}, {"@type": "ListItem", "position": 4, "name": "To Kill a Mockingbird", "url": "https://broadway.com/shows/to-kill-mockingbird/"}, {"@type": "ListItem", "position": 5, "name": "Hamilton", "url": "https://broadway.com/shows/hamilton-broadway/"}, {"@type": "ListItem", "position": 6, "name": "Hadestown", "url": "https://broadway.com/shows/hadestown/"}, {"@type": "ListItem", "position": 7, "name": "Chicago", "url": "https://broadway.com/shows/chicago/"}, {"@type": "ListItem", "position": 8, "

In [6]:
shows = scraped.find('div', class_='popular-shows__list-items')

As stated above, json is the perfect library to hand this situation. As we can see below, now all we need to do is parse through a dictionary of the show's information. Now we can get the name of the show as well as the url.

In [7]:
json.loads("".join(shows.find("script", {"type":"application/ld+json"}).contents))

{'@context': 'http://schema.org',
 '@type': 'ItemList',
 'itemListElement': [{'@type': 'ListItem',
   'position': 1,
   'name': 'Moulin Rouge! The Musical',
   'url': 'https://broadway.com/shows/moulin-rouge-musical/'},
  {'@type': 'ListItem',
   'position': 2,
   'name': 'Wicked',
   'url': 'https://broadway.com/shows/wicked/'},
  {'@type': 'ListItem',
   'position': 3,
   'name': 'The Lion King',
   'url': 'https://broadway.com/shows/the-lion-king/'},
  {'@type': 'ListItem',
   'position': 4,
   'name': 'To Kill a Mockingbird',
   'url': 'https://broadway.com/shows/to-kill-mockingbird/'},
  {'@type': 'ListItem',
   'position': 5,
   'name': 'Hamilton',
   'url': 'https://broadway.com/shows/hamilton-broadway/'},
  {'@type': 'ListItem',
   'position': 6,
   'name': 'Hadestown',
   'url': 'https://broadway.com/shows/hadestown/'},
  {'@type': 'ListItem',
   'position': 7,
   'name': 'Chicago',
   'url': 'https://broadway.com/shows/chicago/'},
  {'@type': 'ListItem',
   'position': 8,
   

In [8]:
#The top line is the same code as above, just renamed as a variable.
all_shows = json.loads("".join(shows.find("script", {"type":"application/ld+json"}).contents))

top_shows = all_shows['itemListElement']

In [9]:
top_shows

[{'@type': 'ListItem',
  'position': 1,
  'name': 'Moulin Rouge! The Musical',
  'url': 'https://broadway.com/shows/moulin-rouge-musical/'},
 {'@type': 'ListItem',
  'position': 2,
  'name': 'Wicked',
  'url': 'https://broadway.com/shows/wicked/'},
 {'@type': 'ListItem',
  'position': 3,
  'name': 'The Lion King',
  'url': 'https://broadway.com/shows/the-lion-king/'},
 {'@type': 'ListItem',
  'position': 4,
  'name': 'To Kill a Mockingbird',
  'url': 'https://broadway.com/shows/to-kill-mockingbird/'},
 {'@type': 'ListItem',
  'position': 5,
  'name': 'Hamilton',
  'url': 'https://broadway.com/shows/hamilton-broadway/'},
 {'@type': 'ListItem',
  'position': 6,
  'name': 'Hadestown',
  'url': 'https://broadway.com/shows/hadestown/'},
 {'@type': 'ListItem',
  'position': 7,
  'name': 'Chicago',
  'url': 'https://broadway.com/shows/chicago/'},
 {'@type': 'ListItem',
  'position': 8,
  'name': 'Waitress',
  'url': 'https://broadway.com/shows/waitress/'},
 {'@type': 'ListItem',
  'position':

In [10]:
#Now we will parse through the nested dictionary above to get the information we're looking for.
show_dict = {}
for show in top_shows:
    show_dict[show['name']] = show['url']

In [11]:
#This is a list of urls that we will visit
urls = list(show_dict.values())

The code below shows an example of what we will do for ALL of the url links. By visiting each one, we will be able to extract the information similar to how we did above. 

In [12]:
response = requests.get(urls[0])
html = response.content
scraped = BeautifulSoup(html, 'html.parser')
    
scraped.find('div', class_='rspCalendar__cellGrid')

<div class="rspCalendar__cellGrid">
<script type="application/ld+json">{"@context": "http://schema.org", "@type": "TheaterEvent", "name": "Moulin Rouge! The Musical - Broadway Tickets", "description": "A theatrical celebration of truth, beauty, freedom and love.", "image": "https://imaging.broadway.com/images/widescreen-169/w1920/120034-15.jpg", "location": {"@type": "Place", "name": "Al Hirschfeld Theatre", "address": {"@type": "PostalAddress", "streetAddress": "302 West 45th Street", "AddressCountry": "US", "addressLocality": "New York", "addressRegion": "NY", "postalCode": "10036"}, "url": "https://www.broadway.com/venues/theaters/al-hirschfeld-theatre/"}, "offers": {"@type": "Offer", "url": "https://www.broadway.com/shows/moulin-rouge-musical/", "price": "69.00", "priceCurrency": "USD", "availability": "http://schema.org/InStock", "validFrom": "2019-07-25"}, "startDate": "2021-10-15T19:00:00-04:00", "endDate": "2021-10-15", "performers": {"@type": "TheaterGroup", "name": "Moulin Ro

The code below will allow us to travel to each of the webpages for the top 25 broadway shows, parse the information, and grab the show name, theatre name, and the price of a ticket. 

In [13]:
show_list = []
description = []
theatre_name = []
price_list = []

for url in urls:
    response = requests.get(url)
    html = response.content
    scraped = BeautifulSoup(html, 'html.parser')
    
    info = scraped.find('div', class_='rspCalendar__cellGrid')
    info = json.loads("".join(info.find("script", {"type":"application/ld+json"}).contents))
    
    show_list.append(info['name'].split('-')[0])
    description.append(info['description'])
    theatre_name.append(info['location']['name'])
    price_list.append(info['offers']['price'])
    
    

Now that we have all of the information we are looking for in list form, we can easily put this into a DataFrame.

In [14]:
broadway = pd.DataFrame({'Shows': show_list, 'Description': description, 'Theater': theatre_name, 'Current Ticket Cost': price_list})

In [15]:
broadway

Unnamed: 0,Shows,Description,Theater,Current Ticket Cost
0,Moulin Rouge! The Musical,"A theatrical celebration of truth, beauty, fre...",Al Hirschfeld Theatre,69.0
1,Wicked,Meet the witches of Oz before Dorothy dropped in.,Gershwin Theatre,89.0
2,The Lion King,Pride Rock comes to life in Disney’s long-runn...,Minskoff Theatre,75.0
3,To Kill a Mockingbird,Harper Lee’s classic courtroom drama comes to ...,Shubert Theatre,29.0
4,Hamilton,A fresh look at the era of the Founding Fathers.,Richard Rodgers Theatre,149.0
5,Hadestown,The Tony-winning musical that follows the myth...,Walter Kerr Theatre,49.0
6,Chicago,The Tony-winning revival of Kander and Ebb’s m...,Ambassador Theatre,49.5
7,Waitress,Sara Bareilles’ score and creatively titled pi...,Ethel Barrymore Theatre,79.0
8,Tina: The Tina Turner Musical,The story of the legendary Tina Turner is now ...,Lunt-Fontanne Theatre,79.0
9,The Music Man,Hugh Jackman returns to Broadway as Harold Hil...,Winter Garden Theatre,99.0
