This is a mellow introduction to BeautifulSoup. The techniques shown here are pretty crude and basic, BUT hopefully they will be illustrative of how web scraping works and how you might go about tackling different websites with weird formats!

In [3]:
# Load in libraries we'll be using
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
import re
from datetime import datetime
import pandas as pd

Let's do one webpage first to cross-check and explore how to access the info we want! The page I'm going to be exploring is [this one](http://69.18.170.204/archives/scripts/cgiip.exe/WService=BibSpeed/fullcit.w?xCID=130000&limit=5000&xBranch=ALL&xsdate=08/01/1940&xedate=09/01/2018&theterm=&x=0&xhomepath=&xhome=). You'll notice this is very bare-bones and not that fancy, which I thought would be good for a tutorial in some ways because you can see how to use BeautifulSoup and other things to get information from weird, strangely formatted sources.

First we'll start by using the requests library to access our webpage's html. You can read more information on requests and why you need it [here](https://www.pythonforbeginners.com/requests/using-requests-in-python).

In [4]:
# Put your URL of interest in the quotes if you'd like to look at something else
page = requests.get('http://69.18.170.204/archives/scripts/cgiip.exe/WService=BibSpeed/fullcit.w?xCID=130000&limit=5000&xBranch=ALL&xsdate=08/01/1940&xedate=09/01/2018&theterm=&x=0&xhomepath=&xhome=')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser') # Note 'page' is just named that bc that's what we named it above; same
# for 'soup'. You can name the BeautifulSoup object whatever you want.

In [5]:
# It can be illustrative to look at your data directly
soup

<!-- Generated by Webspeed: http://www.webspeed.com/, http://www.possenet.org/ -->
<html>
<head>
<link href="bibspd.css" rel="STYLESHEET" type="text/css"/>
<meta content="Inmagic, Inc." name="Author"/>
<title>BiblioTech PRO V3.2b</title>
</head>
<body>
<p>[Met Performance] CID:130000<br/>New production<br/><cite>Un Ballo in Maschera {23}</cite>  Metropolitan Opera House:  12/2/1940.<br/><br/>(Opening Night {56}<br/>Edward Johnson, General Manager<br/><br/>Debuts: Alexander Sved, Mary Smith, Mstislav Dobujinsky<br/>Reviews)<br/><p><code><br/>Metropolitan Opera House<br/>December 2, 1940<br/>Opening Night  {56}<br/>New production<br/> <br/><b>Edward Johnson, General Manager</b><br/><br/><br/><b>UN BALLO IN MASCHERA </b>{23}<br/>Giuseppe Verdi--Antonio Somma<br/><br/>Amelia..................Zinka Milanov<br/>Riccardo................Jussi Björling<br/>Renato..................Alexander Sved [Debut]<br/>Ulrica..................Kerstin Thorborg<br/>Oscar...................Stella Andreva<br/>S

So now we have an understanding of what our html looks like and how it is structured. You can also explore the html using 'Inspect' in your web browser (name might change depending on what you are using -- for Chrome at least, you right click and 'Inspect' and you'll get a little window that shows the way the page is formatted. This is helpful to explore and look at when you are trying to figure out what things to select and how to isolate them.

Let's say we want to get the opera name from this page. To do this, you want to look for tags that you can use to isolate that information. For example, we can note that the opera name occurs between some cite tags because it happens to be italicized. This is how we'll isolate it.

In [6]:
# Use BeautifulSoup to find those things between a tag that says 'cite'
tag = soup.find("cite")
# Turn the contents of the tag into a string
opera = str(tag)
# Now for this particular situation, we need to clean this up. Maybe this is inefficient, but we'll use regex
opera2 = re.sub('<.*?>', '', opera)
opera3 = re.sub('{.*?}','', opera2)

# Always check of course to make sure what we are doing is working
opera3

'Un Ballo in Maschera '

Now we want to get information about the artists who performed on this date. This information is all stored in a big text block, so we'll access it by grabbing everything that occurs between < br > tags. 

**These steps will always be customized to the webpage and information you're trying to get! You have to play around a lot in the beginning just figuring out where things are and how to scrape them!**

In [9]:
# Just gets ALL the text
test = soup.get_text('<br/>', strip=False)
# This gets rid of the html formatting and prints it like would look on the website
# What it is doing is replacing the <br/>s with /ns just to make things a bit cleaner!
test = "\n".join(test.split("<br/>"))
print(test)











BiblioTech PRO V3.2b






[Met Performance] CID:130000
New production
Un Ballo in Maschera {23}
  Metropolitan Opera House:  12/2/1940.
(Opening Night {56}
Edward Johnson, General Manager
Debuts: Alexander Sved, Mary Smith, Mstislav Dobujinsky
Reviews)
Metropolitan Opera House
December 2, 1940
Opening Night  {56}
New production
 
Edward Johnson, General Manager
UN BALLO IN MASCHERA 
{23}
Giuseppe Verdi--Antonio Somma
Amelia..................Zinka Milanov
Riccardo................Jussi Björling
Renato..................Alexander Sved [Debut]
Ulrica..................Kerstin Thorborg
Oscar...................Stella Andreva
Samuel..................Norman Cordon
Tom.....................Nicola Moscona
Silvano.................George Cehanovsky
Judge...................John Carter
Servant.................Lodovico Oliviero
Dance...................Ruthanna Boris
Dance...................Monna Montes
Dance...................Lillian Moore
Dance...................Mary Smith [Debut]
Dance.....

In [10]:
# Extract the date out of the text
# REGULAR EXPRESSIONS <3
match = re.search(r'\d{2}/\d{1,2}/\d{4}', test)
# Make this an actual date
date = datetime.strptime(match.group(), '%m/%d/%Y').date()
print (date)

1940-12-02


In [11]:
# Make a function that turns a string into a list so we can more easily manipulate it
def Convert(string):
    li = list(string.split("\n"))
    return li
# What this is doing is saying whenever we see a new line marker, that's a new entry in our list!

In [12]:
# Apply this to our text
test2 = Convert(test)

In [13]:
# Get roles people are playing -- basically just getting those entries that are of interest to us. 
# This is saying: if my entry doesn't have these words, it's something I want so keep it!
roles = [k for k in test2 if '...' in k and 'Director' not in k and 'Conductor' not in k and 'Set designer' not in k and 'Costume designer' not in k and 'Dance' not in k and 'Choreographer' not in k]
# NOW this is because of our weird formatting -- what this is saying is, take advantage of the dots separating
# role and singer and split the string there (we want these things as separate bc we only want singers!)
# We'll just separate it out to roles at the moment
roles2 = [i.split('.', 1)[0] for i in roles]
roles2

['Amelia',
 'Riccardo',
 'Renato',
 'Ulrica',
 'Oscar',
 'Samuel',
 'Tom',
 'Silvano',
 'Judge',
 'Servant']

In [15]:
# Just get performers step 1
performers = [k for k in test2 if '...' in k and 'Director' not in k and 'Conductor' not in k and 'Set designer' not in k and 'Costume designer' not in k and 'Dance' not in k and 'Choreographer' not in k]
#Have to figure out a way to just get everything after the last dot... perhaps some help from here: 
# https://stackoverflow.com/questions/26665265/regex-to-find-text-after-period
# Basically get rid of character names
performers2 = [i.split('.', 1)[1] for i in performers]
performers3 = [re.sub(r'\.', '', i) for i in performers2]
performers4 = [re.sub(r' \[Debut\]', '', i) for i in performers3]
performers4


['Zinka Milanov',
 'Jussi Björling',
 'Alexander Sved',
 'Kerstin Thorborg',
 'Stella Andreva',
 'Norman Cordon',
 'Nicola Moscona',
 'George Cehanovsky',
 'John Carter',
 'Lodovico Oliviero']

In [16]:
# Now we'll get the CID, or performance id
# Extract the CID for reference
CID = [k for k in test2 if 'CID' in k]
CID2 = [re.sub(r'CID:', '', i) for i in CID]
CID3 = [re.sub(r'\[Met Performance\] ', '', i) for i in CID2]
CID3 = str(CID3)

In [17]:
# Create a data frame where we have performers connected to the date and CID and opera name
# We'll add on to this when we actually scrape more than one page!
df = pd.DataFrame({'artists':performers4})
# Add values to list
df['CID'] = CID3
df['opera'] = opera
df['date'] = date
print(df)

             artists         CID                                   opera  \
0      Zinka Milanov  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
1     Jussi Björling  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
2     Alexander Sved  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
3   Kerstin Thorborg  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
4     Stella Andreva  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
5      Norman Cordon  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
6     Nicola Moscona  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
7  George Cehanovsky  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
8        John Carter  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   
9  Lodovico Oliviero  ['130000']  <cite>Un Ballo in Maschera {23}</cite>   

         date  
0  1940-12-02  
1  1940-12-02  
2  1940-12-02  
3  1940-12-02  
4  1940-12-02  
5  1940-12-02  
6  1940-12-02  
7  1940-12-02  
8  1940-12-02  
9  

Yes! There are some funky things going on in those strings. We'll deal with those below when we do this iteratively. 

### Scraping More than One Page!
There are fancier ways to do this with Selenium et al. BUT if you are lucky enough to have a site that has this structure, this might be faster / more efficient? Either way it might be useful to know :D

In [18]:
# Ok, now we have a series of commands we can use on each url to get the information that we want... 
# how do we do it iteratively?
# Iterate through URLs
pages = []
dfs = []

# When you are scraping from a page, it's good to include this information so you let people know who you are
# and let them know how to contact you in case you are causing them problems by hitting their website a bunch :O
headers = {
    'User-Agent': 'Kate Lyons, https://lyons7.github.io/',
    'From': 'k.lyons7@gmail.com'
}

# So for me, the only thing that changes in my URL is this one number which is the performance id.
# Maybe you will be this lucky and get a predictive URL like this! Maybe you won't, and then something like
# Selenium might be a better fit for your task 
for i in range(130000, 130100):
    url = 'http://69.18.170.204/archives/scripts/cgiip.exe/WService=BibSpeed/fullcit.w?xCID=' + str(i) + '&limit=5000&xBranch=ALL&xsdate=08/01/1940&xedate=09/01/2018&theterm=&x=0&xhomepath=&xhome='
    pages.append(url)
# Above what we are doing is just getting a list of URLs to scrape

# Scrape those pages:
for item in pages:
    page = requests.get(item, headers = headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Identify opera
    tag = soup.find("cite")
    opera = str(tag)
    try:
        opera2 = re.sub('<.*?>', '', opera)
        opera3 = re.sub('{.*?}','', opera2)
        opera4 = re.sub('\[.*?\]','', opera3)
    except:
        opera = None # This is in case we are getting a performance that doesn't have an opera name (like a gala)

    # Turn into list for easier processing
    test = soup.get_text('<br/>', strip=False)
    test = "\n".join(test.split("<br/>")) # Same stuff as we did above

    # Get dates
    # Have to do it this way in case you don't find a date for an entry
    try:
        match = re.search(r'\d{2}/\d{1,2}/\d{4}', test)
        date = datetime.strptime(match.group(), '%m/%d/%Y').date()
    except:
        match = None # In case this thing doesn't have a date (this will ensure your loop keeps running and going thru)

    # Make list
    def Convert(string):
        li = list(string.split("\n"))
        return li
    test2 = Convert(test)

    # roles
    roles = [k for k in test2 if '...' in k and 'Director' not in k and 'Conductor' not in k and 'Set designer' not in k and 'Costume designer' not in k and 'Dance' not in k and 'Choreographer' not in k and 'Production' not in k and 'Lighting designer' not in k and 'Choreography' not in k and 'Designer' not in k and 'Projection Designer' not in k and 'Associate Designer' not in k and 'Dramaturg' not in k and 'Harpsichord' not in k]
    roles2 = [i.split('..', 1)[0] for i in roles] # Get roles! Same thing as before...

    # artists
    performers = [k for k in test2 if '...' in k and 'Director' not in k and 'Conductor' not in k and 'Set designer' not in k and 'Costume designer' not in k and 'Dance' not in k and 'Choreographer' not in k and 'Production' not in k and 'Lighting designer' not in k and 'Choreography' not in k and 'Designer' not in k and 'Projection Designer' not in k and 'Associate Designer' not in k and 'Dramaturg' not in k and 'Harpsichord' not in k]
    performers2 = [i.split('..', 1)[1] for i in performers]
    performers3 = [re.sub(r'\.', '', i) for i in performers2]
    # More cleaning to do: get rid of this extra information that we don't need atm
    performers4 = [re.sub(r' \[Debut\]', '', i) for i in performers3]
    performers5 = [re.sub(r' \[First appearance\]', '', i) for i in performers4] 
    performers6 = [re.sub(r' \[Last performance\]', '', i) for i in performers5]
    performers7 = [re.sub(r' \[Last appearance\]', '', i) for i in performers6]


    # Performance id
    CID = [k for k in test2 if 'CID' in k]
    CID2 = [re.sub(r'CID:', '', i) for i in CID]
    # CID3 = [re.sub(r'\[Met Performance\] ', '', i) for i in CID2]
    # CID4 = [re.sub(r'\[Met Concert/Gala\] ', '', i) for i in CID3]
    # CID5 = [re.sub(r'\[Met Presentation\] ', '', i) for i in CID4]
    CID6 = str(CID2)


    # Put these in a temporary dataframe
    tempdf = pd.DataFrame({'role':roles2})
    tempdf['artist'] = performers5
    tempdf['opera'] = opera4
    tempdf['date'] = date
    tempdf['CID'] = CID6

    dfs.append(tempdf)
   
# Create a master data frame when we are all finished!
masterDF = pd.concat(dfs, ignore_index=True)
masterDF


# Get rid of performances we aren't interested in like galas and summer concerts
masterDF3 = masterDF[masterDF['CID'].str.contains("Performance")]
masterDF3

Unnamed: 0,role,artist,opera,date,CID
0,Amelia,Zinka Milanov,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
1,Riccardo,Jussi Björling,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
2,Renato,Alexander Sved,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
3,Ulrica,Kerstin Thorborg,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
4,Oscar,Stella Andreva,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
5,Samuel,Norman Cordon,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
6,Tom,Nicola Moscona,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
7,Silvano,George Cehanovsky,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
8,Judge,John Carter,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']
9,Servant,Lodovico Oliviero,Un Ballo in Maschera,1940-12-02,['[Met Performance] 130000']


Here's another example from another website where I had to do things a little differently. Thought I would include it to show another specific html context and how to navigate it.

Now we will get opera aria information. I needed this because for each opera I also wanted to get composer info, language info and also the arias themselves for each opera so I could look up what the top 5 most popular ones were on Spotify!

Get data on pieces and operas themselves ([website](http://www.opera-arias.com/scenes/&page=1#x)). A lot of help from [here](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3) and [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)!


In [19]:
# Collect and parse first page
# This website is a great database of opera scenes! 
page = requests.get('http://www.opera-arias.com/scenes/&page=1#x')
# And look ^ -- really helpful URL formatting where we just have page numbers wooo!

# Turn this in to a BS object
soup = BeautifulSoup(page.text, 'html.parser')

In [20]:
# We are dealing with a table format (see webpage) -- we have even rows and uneven rows and will 
# attack them one by one
uneven_containers = soup.find_all('div', class_ = 'tr_uneven')
even_containers = soup.find_all('div', class_ = 'tr_even')

In [29]:
# Opera id first
# What this start is doing is saying move thru the uneven_containers list and get stuff from there (things are tiered!)
# We will do this again for the even_containers
# This current run will look rows 1, 3, 5, 7 ... etc.
start = uneven_containers[0]

# This will give us the scene_id  
scene_id = start.find('span', class_='span_s s_id') # Again, found this thru rooting around 
scene_id = scene_id.text
scene_id

'1'

In [30]:
# Scene name
scene_name = start.a.text
scene_name

'A bas les épées'

In [31]:
# Opera
scene_opera = start.find('span', class_='span_s s_opera')
scene_opera = scene_opera.text # This is a bit different than above -- we actually have a container for each thing
# we are interested in and can grab the text from it by using .text
# Again -- how you get stuff will depend on your webpage!
scene_opera

'Otello'

In [34]:
# Composer
scene_composer = start.find('span', class_='span_s s_composer')
scene_composer = scene_composer.text
print(scene_composer)

# Act
scene_act = start.find('span', class_='span_s s_act')
scene_act = scene_act.text
print(scene_act)

# Type
scene_type = start.find('span', class_='span_s s_type')
scene_type = scene_type.text
print(scene_type)

# Voice
scene_voice = start.find('span', class_='span_s s_voice')
scene_voice = scene_voice.text
print(scene_voice)

# Language
scene_language = start.find('span', class_='span_s s_lang')
scene_language = scene_language.text
print(scene_language)

# Role
scene_role = start.find('span', class_='span_s s_role')
scene_role = scene_role.text
print(scene_role)

Verdi
1.08-1 
recitative
T Br
Italian
Otello/Iago


In [35]:
# ITERATE
# Blatantly stolen from here: https://www.dataquest.io/blog/web-scraping-beautifulsoup/

# Setting lists to store data in as you go thru
pages = []
scene_id = []
scene_name = []
opera = []
composer = []
act = []
scene_type = []
voice = []
language = []
role = []

# Let people know who you are so you don't freak them out
dfs = []
headers = {
    'User-Agent': 'Kate Lyons, https://lyons7.github.io/',
    'From': 'k.lyons7@gmail.com'
}

# URL is http://www.opera-arias.com/scenes/&page=1#x

# To update the page numbers in the crawl
for i in range(1, 15): # More than this, but we'll do a smaller number to illustrate
    url = 'http://www.opera-arias.com/scenes/&page=' + str(i) + '#x'
    pages.append(url)


for item in pages:
    page = requests.get(item, headers = headers)

    # Parse the content of the request with BeautifulSoup
    soup = BeautifulSoup(page.text, 'html.parser')

    # Select the thing you are you interested in -- let's do uneven ones first
    for t in range(0, 15): # This is specific to my page -- I have 30 rows per page, so just need to have 15 list entries
        # If I went over this I'd have an error so you need to specify (it needs to know how far to iterate)!
        uneven_containers = soup.find_all('div', class_ = 'tr_uneven')
        start = uneven_containers[t]

        # Scene name
        scene_name1 = start.a.text
        scene_name.append(scene_name1)

        #Opera
        scene_opera1 = start.find('span', class_='span_s s_opera')
        scene_opera2 = scene_opera1.text
        opera.append(scene_opera2)

        # Composer
        scene_composer1 = start.find('span', class_='span_s s_composer')
        scene_composer2 = scene_composer1.text
        composer.append(scene_composer2)

        # Act
        scene_act1 = start.find('span', class_='span_s s_act')
        scene_act2 = scene_act1.text
        act.append(scene_act2)

        # Type
        scene_type1 = start.find('span', class_='span_s s_type')
        scene_type2 = scene_type1.text
        scene_type.append(scene_type2)

        # Voice
        scene_voice1 = start.find('span', class_='span_s s_voice')
        scene_voice2 = scene_voice1.text
        voice.append(scene_voice2)

        # Language
        scene_language1 = start.find('span', class_='span_s s_lang')
        scene_language2 = scene_language1.text
        language.append(scene_language2)

        # Role
        scene_role1 = start.find('span', class_='span_s s_role')
        scene_role2 = scene_role1.text
        role.append(scene_role2)

    # Start again with evens
    # Select the thing you are you interested in -- let's do even ones now
    for k in range(0, 15):
        even_containers = soup.find_all('div', class_ = 'tr_even')
        start2 = even_containers[k]

        # Scene name
        scene_name2 = start2.a.text
        scene_name.append(scene_name2)

        #Opera
        scene_opera3 = start2.find('span', class_='span_s s_opera')
        scene_opera4 = scene_opera3.text
        opera.append(scene_opera4)

        # Composer
        scene_composer3 = start2.find('span', class_='span_s s_composer')
        scene_composer4 = scene_composer3.text
        composer.append(scene_composer4)

        # Act
        scene_act3 = start2.find('span', class_='span_s s_act')
        scene_act4 = scene_act3.text
        act.append(scene_act4)

        # Type
        scene_type3 = start2.find('span', class_='span_s s_type')
        scene_type4 = scene_type3.text
        scene_type.append(scene_type4)

        # Voice
        scene_voice3 = start2.find('span', class_='span_s s_voice')
        scene_voice4 = scene_voice3.text
        voice.append(scene_voice4)

        # Language
        scene_language3 = start2.find('span', class_='span_s s_lang')
        scene_language4 = scene_language3.text
        language.append(scene_language4)

        # Role
        scene_role3 = start2.find('span', class_='span_s s_role')
        scene_role4 = scene_role3.text
        role.append(scene_role4)


# Now we want to put everything everything together
opera_info = pd.DataFrame({'scene_name': scene_name,
                           'opera': opera2,
                           'composer': composer,
                           'act': act,
                           'scene_type': scene_type,
                           'voice': voice,
                           'language': language,
                           'role': role})
opera_info

Unnamed: 0,scene_name,opera,composer,act,scene_type,voice,language,role
0,A bas les épées,,Verdi,1.08-1,recitative,T Br,Italian,Otello/Iago
1,A bitter thought,,Verdi,2.15,duet,S T,French,Hélène/Gaston
2,A brief moment of life,,Verdi,4.06,recitative,B S T,Italian,Pagano (Hermit)/Giselda/Arvino
3,A ce mot tout s'anime,,Meyerbeer,2,aria,soprano,French,Marguerite
4,A cette voix quel trouble,,Bizet,1.10,"recitative,aria",tenor,French,Nadir
5,A che smarriti e pallidi,,Verdi,2.04,quartet,B T Br B,Italian,Federico/Arrigo/Rolando/Mayor
6,A chi serena io giro,,Mozart,1.17,aria,soprano,Italian,Fortuna
7,A consolarmi affrettisi,,Donizetti,1.17,duet,T S,Italian,Carlo/Linda
8,A costoro quei nume perdoni,,Verdi,0.03,recitative,B T T,Italian,Alvaro/Otumbo/Zamoro/Indians
9,A deux cuartos,,Bizet,3.11,choir,,French,Street Sellers


There you go!