## Collecting Data from the Web

In [1]:
import pandas as pd

When dealing with text in data/code we often want to represent special characters, which might have a weird effect in our code
* For example the double quote " or single quote might be something we want to incorporate into our string, but not want to write
* Similarly with carriage returns/newlines

A lot of this is built into many programming/coding languages, including Python

For example think about writing the following HTML string in Python:

In [2]:
stringIn="<h2>Hello</h2>\n\n\t<div id=\"main\">Main text inside a content divider</div>\n\n<footer style=\"color:#003594;background-color:#FFB81C\"><center>Bottom footer text in Blue</center></footer>\\"

When this is printed or written to a data file though, this same string would look like this:

In [3]:
print(stringIn)

<h2>Hello</h2>

	<div id="main">Main text inside a content divider</div>

<footer style="color:#003594;background-color:#FFB81C"><center>Bottom footer text in Blue</center></footer>\


Here I'm using the code: 
* `\n` in place of a newline
* `\t` in place of a Tab
* `\"` in pace of a double quote

Because `\` is used here as an escape character if we want to use the **actual** `\` character in a string we need to type `\\`

As an aside, whenever you use strings in Python you do need to be careful with certain characters, which is why we were compiling the expressions above first.

Because the backslash character is special within the "regular expressions" we will talk about below, we need to make sure that when we pass them to the `re` module that it's seeing the same thing as us
* Anytime "\\" appears in the string, it is telling the compiler that this is something special. 
* If you're actually looking for "\\" then you need to escape it as "\\\\" within a Python string,  as Python will interpret `\\` the special backslash character on its own; telling the regular expression to look for the special string  "\\"
* Some useful escape characters are: 
    - `\'` is '
    - `\"` is " 
    
*Note: if you look at the Markdown for this cell, I wasn't able to write these strings directly unless they were in the code quotes!*

The same raw strings can be interpreted in HTML. Here I'm doing this in the below cell by using a *magic*, something which tells IPython that renders this workbook to not run the cell in python,  but to instead render this cell as HTML.

HTML is it's another markup language which has its own special characters, here anythings inside a tag: `<  >`

In [4]:
%%HTML 
<h2>Hello</h2>

	<div id="main">Main text inside a content divider</div>

<footer style="color:#003594;background-color:#FFB81C"><center>Bottom footer text in Blue</center></footer>

While we can write pure HTML and style it, websites will often use something called a cascading style sheet (CSS). This is a list of rules about how to format HTML objects with certain properties. Having a systematic styling scheme often means that websites will give elements of the HTML special identifiers like the `id` in the `<div>` tag above. This structure can be useful when we want to extract data.

However, when we call the string in Python (without printing it) it is simply the characters we entered:

In [5]:
stringIn

'<h2>Hello</h2>\n\n\t<div id="main">Main text inside a content divider</div>\n\n<footer style="color:#003594;background-color:#FFB81C"><center>Bottom footer text in Blue</center></footer>\\'

## Search Patterns

In addition to knowing how to enter these special characters, escape characters are also used in flexible search patterns called **regular expressions**. These highly stuctured patterns can then be used to look for and extract data.

**Proviso:** I am not an expert at using regular expressions. I know they exist, and I know how to use them *a little*. Almost all the time I have a special use for a complex regular expression, a little googling finds me the answer. 

While I can do basic regular expressions on the go, when I'm doing something more complicated I mostly go to [this website](https://regexr.com/) and paste in a sample of the the data I'm using, to check/build the regular expression!
* That way I can paste in some sample data and see what the regular expression will pick up!
* These regular expression are supplemented with additional structure that comes from a language called PERL, which allows for even finer detail.

We'll use the following string as a search pattern:

In [6]:
searchString="My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.\n\n The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. \n\nThe department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.\n"
print(searchString)

My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.

 The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. 

The department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.



And we'll check what they return by using the `re` library in Python:

In [7]:
# import regular expressions library
import re # this is actually a library from C

## Some examples:

* `(?:\s)?([\w\.-]+@[\w\.-]+\.\w{2,4})(?:\s\.)?`
    - `(?: )` :  represents a group we don't want, but is necessary to the search
    - `(  )` : without the question mark means a group to return in the output
    - `\s` means a space
    - After any character or group:
        - `?` after a character or group means 0 or 1 occurences
        - `+` means 1 or more occurences
        - `*` means 0 or more.
    - `[\w\.- ]+` : means  word characters, dots (plain dots, so escaped) or a dash,  repeated one or more times (the `+` modifier)
    - `@` has no special meaning here, it just means we require the "@" character in the pattern
    - `\w{2,4}` : means a word character, appearing two to four times (so "com", "gov", "ca").

So this will hopefully match email addresses.

In [10]:
print(searchString)
emailFinder=re.compile("(?:\s)?([\w\.-]+@[\w\.-]+\.\w{2,4})(?:\s\.)?", re.S )
re.findall( emailFinder,searchString )

My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.

 The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. 

The department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.



  emailFinder=re.compile("(?:\s)?([\w\.-]+@[\w\.-]+\.\w{2,4})(?:\s\.)?", re.S )


['alistair@pitt.edu', 'gl20@andrew.cmu.edu']

Twitter:
* `(@[A-z]+)|(#[A-z]+)`
    - `@` : The @ character
    - `[A-z]+` any number of characters A-Z or a-z
    - `|` : Boolean OR
    - `#` : the hash symbol

So this will match twitter user names (the first part) **or** hashtags (the second part)

In [11]:
print(searchString)
tweetFinder=re.compile('(?:\s)(@[A-z]+)|(#[A-z]+)', re.S )
re.findall( tweetFinder,searchString )

My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.

 The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. 

The department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.



  tweetFinder=re.compile('(?:\s)(@[A-z]+)|(#[A-z]+)', re.S )


[('@PittEcon', ''), ('', '#PittMQE')]

What would we need to do to get rid of the entries we don't want?

You can also extract separated groups:
* `(?:\d{1}\s)?\(?(\d{3})\)?-?\s?.?(\d{3})-?\s?.?(\d{4})`
    - `(?: )` this allows this to be there, but doesn't retain it in the capture (so a country code or a space)
    - `?` when outside the parentheses this means either 0 or 1 repetition of the prior block
    - `\(?` : so either an open parenthesis or not
    - `(  )`: when the parentheses aren't escaped with `\` or put in a non captured group `(?  )`, this is saying we want to get this in the output
    - `\d{3}` : exactly three digits 0-9
    - `\)?` : 0 or 1 closing parenthesis
    - `-?`  , `\s?`: possibly a dash, possibly a space
    - `(\d{3})` : three more digits, capture them
    - `(\d{4}))` : four digits capture them

In [15]:
print(searchString)
phoneFinder=re.compile("(?:\d{1}\s)?\(?(\d{3})\)?-?\s?.?(\d{3})-??\s?.?(\d{4})")
re.findall(phoneFinder,searchString)

My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.

 The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. 

The department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.



  phoneFinder=re.compile("(?:\d{1}\s)?\(?(\d{3})\)?-?\s?.?(\d{3})-??\s?.?(\d{4})")


[('412', '383', '8152'), ('412', '383', '5425'), ('412', '648', '1737')]

Why did we not find the last number? How can we fix it?

In [13]:
searchString

"My office number used to be (412) 383 8152, if you wanted to call my office, but you now can only send an email to alistair@pitt.edu to contact me.\n\n The main number for the MQE is +1-412-383-5425. I think Randy's office number is 412.648.1737. \n\nThe department has a twitter account at @PittEcon (look for the #PittMQE tag). Sometimes email addresses are like George's gl20@andrew.cmu.edu.\n"

## Looking at HTML

Consider the HTML code:
```
    <table>
         <tr><td type="normal"><b>Column 1</b></td><td><b>Column 2</b></td> </tr>
        <tr><td type="normal">Entry 1</td><td>Entry 2</td> </tr>
        <tr><td type="other">Entry 3</td><td>Entry 4 </td></tr>
    </table>
```

It creates a simple structured table, and looks like this when rendered: (again using a magic, and without any fancy table styling commands)

In [16]:
%%HTML
<table>
    <tr><td type="normal"><b>Column 1</b></td><td><b>Column 2</b></td> </tr>
    <tr><td type="normal">Entry 1</td><td>Entry 2</td> </tr>
    <tr><td type="other">Entry 3</td><td>Entry 4 </td></tr>
</table>

0,1
Column 1,Column 2
Entry 1,Entry 2
Entry 3,Entry 4


So if we wanted to try and get at some of the table data here:

* `(?:<td.*?>)(.*?)(?:<\/td>)`
    - So this asks us to match but not keep a `<td` string first followed by 
        - `.` means any character (as there may very well be a lot of styling information within the `<td>` tag 
        - `*` means we repeat the any character any number of times 
        - but `*?` means as few as possible, so we'll try and find the closing `>` tag as soon as we encounter one
    - `(.*?)` is our main data capture group, and so this will capture everything between the `<tr>` and the closing `</td>` tag (not `*?` specifies as few as possible again)
    - `(?:<\/td>)` : says don't capture the closing tag `</td>`, but that it's necessary

But when you apply our regular expression it will hopefully extract six entries:
* `<b>Column 1</b>`
* `<b>Column 2</b>`
* `Entry 1`
* `Entry 2`
* `Entry 3`
* `Entry 4`

In [17]:
tableString='<table>\n\t<tr><td type="normal"><b>Column 1</b></td><td><b>Column 2</b></td> </tr>\n\t<tr><td type="normal">Entry 1</td><td>Entry 2</td> </tr>\n\t<tr><td type="other">Entry 3</td><td>Entry 4 </td></tr>\n</table>'
tdFinder=re.compile('(?:<td.*?>)(.*?)(?:<\/td>)')
pd.DataFrame({"found": re.findall(tdFinder,tableString)}) # I just add the list to a dict to label the column in the dataframe

  tdFinder=re.compile('(?:<td.*?>)(.*?)(?:<\/td>)')


Unnamed: 0,found
0,<b>Column 1</b>
1,<b>Column 2</b>
2,Entry 1
3,Entry 2
4,Entry 3
5,Entry 4


Often times in web data, we can use the styling information to pick out specific types of data entries.

For example, suppose that we only wanted to extract the `normal` type `td` entries?

In [18]:
tdFinder=re.compile('(?:<td.*?type=\"normal\".*?>)(.*?)(?:<\/td>)')
re.findall(tdFinder,tableString)

  tdFinder=re.compile('(?:<td.*?type=\"normal\".*?>)(.*?)(?:<\/td>)')


['<b>Column 1</b>', 'Entry 1']

Regular expressions used to be the backbone of how you'd conduct a search in a website to extract information you wanted.

And they're still very useful to know about when for looking for patterns in text, particularly if you don't have more structure so let's go through how to extract web data with some regular expressions.

We can run regular expressions on web data within Python fairly easily, where we're already played with the `re` library a bit above. Let's load `requests` so we can fetch web pages.

In [1]:
import requests # This loads the library that gives us the `get` method

Let's look at the course [home page](https://www.mqe.pitt.edu/courses) for the MQE. In terms of when you're building these expressions, I always find if useful to load the web page in a browser and open the variable inspector!

In [2]:
response=requests.get('https://www.mqe.pitt.edu/economics-data-science-communication-courses')
response

<Response [200]>

In [3]:
WebsiteHTML=response.text

Now we load the response as raw text to look through

In [4]:
WebsiteHTML=response.text
print(WebsiteHTML[500:1000]) # show the first 3001 characters

quiv="X-UA-Compatible" content="IE=edge" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="msapplication-TileColor" content="#2b5797"/>
<meta name="msapplication-config" content="/sites/default/files/favicons/browserconfig.xml"/>
<meta name="theme-color" content="#1c2957"/>
<meta name="description" content="A Masters in Economics will equip you with skills in: Economic Modeling and Intuition Data Scie


When we load objects as regular expressions in python, we need to tell it to treat the string as a raw string by prefacing it with  `r`, that way we don't have to escape all the special characters we'd want to use or compile it separately.

In [24]:
print('This is not a \n\traw string so the new lines aren\'t shown')
print(r'This is a \n\traw string so the new lines aren\'t shown')

This is not a 
	raw string so the new lines aren't shown
This is a \n\traw string so the new lines aren\'t shown


It's possible to call a regex string directly without compiling it so long as it's a raw string. However, if you're going to use a regex many times you will be better off compiling it.

We'll use the `re` function to look through the data again, where we'll add three additional flags to the search:
* `re.M` : multi-line, so the pattern will reset when looking across multiple lines
* `re.I` : ignore case for characters, so A and a will be treated the same
* `re.S` : This flag means that the special character `.` that matches everything also includes newlines 

In [26]:
matchTableData = re.findall(r'(?:<td.*?>)(.*?)(?:<\/td>)', WebsiteHTML, re.M|re.I|re.S)
print(len(matchTableData))
type(matchTableData)

11


list

In [27]:
print(matchTableData[0])
print(matchTableData[1])

<a href="#firms-markets">Individuals, Firms, and Markets</a>
<a href="#quant-methods">Quantitative Methods</a>


Define a function that strips our  anything matching a pattern search, where we'll use a search for matching tags to strip out all the html.

In [28]:
tagMatch = re.compile('<.*?>', re.S) #
def stripText(text,match=tagMatch,replace=''): 
    # define a function to remove any HTML tags
    return re.sub(match, replace, text )

In [29]:
print(tableString)

<table>
	<tr><td type="normal"><b>Column 1</b></td><td><b>Column 2</b></td> </tr>
	<tr><td type="normal">Entry 1</td><td>Entry 2</td> </tr>
	<tr><td type="other">Entry 3</td><td>Entry 4 </td></tr>
</table>


In [30]:
print(stripText(tableString,tagMatch))


	Column 1Column 2 
	Entry 1Entry 2 
	Entry 3Entry 4 



Or being slightly fancier:

In [31]:
whitespaceMatch = re.compile('\s+', re.S) #
print(stripText(stripText(tableString,tagMatch," ") ,whitespaceMatch," "))

 Column 1 Column 2 Entry 1 Entry 2 Entry 3 Entry 4 


  whitespaceMatch = re.compile('\s+', re.S) #


In [32]:
matchTableDataList=[]
for item in matchTableData: # for each entry in our table search
    # Here I just extract hyperlink tags that look like <a href="#link-location">Label</a>. There point to internal links.
    tupleIn=re.findall(r'(?:<a.*?)href=\"#?(.*?)\"(?:.*?)>(.*?)(?:<\/a>)', item, re.M|re.I|re.S)
    for links in tupleIn: # now I look at each of the entries I found in this match
        # Here I construct a new regular expression using the relative link location I found
        # This looks for the H4 tags which contain the targets for the link
        # I then move into the <p> tag below it to get the course description.
        searchString=re.compile(r'(?:<h4.*?>.*?<a\s.*?id=\"'+links[0]+ r'\".*?>.*?<\/h4>.*?<p.*?>)(.*?)(?:<\/p>)',re.I|re.S)
        desc=re.findall( searchString, WebsiteHTML)
        # I then put this into a dict so that when I convert this to a dataframe it will already be set up correctly
        matchTableDataList.append({"location":links[0], "label": links[1],"description":stripText(desc[0])})
# Convert to dataframe
courseInformation=pd.DataFrame(matchTableDataList)

In [33]:
courseInformation

Unnamed: 0,location,label,description
0,firms-markets,"Individuals, Firms, and Markets","Individuals, Firms, and Markets provide a rigo..."
1,quant-methods,Quantitative Methods,Quantitative Methods present a framework for d...
2,comm-econ-insights,Communicating Economic Insights,Communicating Economic Insights helps students...
3,econ-interference,Economic Inference from Data,Economic Inference from Data provides hands-on...
4,econ-analysis-tech,Applications of Economic Analysis Techniques,Applications of Economic Analysis Techniques m...
5,data-design,Data Design for Economic Applications (Capstone),Data Design for Economic Applications (Capston...
6,evidence-based,"Evidence-Based Analysis in Labor, Public and H...","Evidence-Based Analysis in Labor, Public, and ..."
7,big-data,Big Data and Forecasting in Economics,Big Data and Forecasting in Economics covers c...
8,incentives-info,Incentives and Information,Incentives and Information are central to mode...


Export to a csv

In [None]:
courseInformation.to_csv('CousreInf.csv')

And so with only some knowledge for how the web page is constructed, I've extracted the course listing across multiple locations from the MQE website!

In [34]:
print(courseInformation['label'][6])
print(courseInformation['description'][6])

Evidence-Based Analysis in Labor, Public and Health Economics
Evidence-Based Analysis in Labor, Public, and Health Economics allows students to further develop their MQE toolkits through exposure to both seminal and frontier applied research on a diverse set of topics such as education, environmental sustainability, the non-profit sector, and employment compensation. In addition to reviewing extant applied research, students will hone their own analytical approach working both individually and in groups to apply economic thinking and analysis to a broad set of business and policy problems.


## Other things to look for:

We can look for table rows...

In [35]:
matchTableRow = re.findall(r'<tr.*?>(.*?)<\/tr>', WebsiteHTML, re.M|re.I|re.S)
matchTableRow[0]

'\n\t\t\t<th colspan="1" rowspan="2" scope="row" width="5%">Fall</th>\n\t\t\t<th scope="row" width="12%">Session 1</th>\n\t\t\t<td width="34%"><a href="#firms-markets">Individuals, Firms, and Markets</a></td>\n\t\t\t<td width="27%"><a href="#quant-methods">Quantitative Methods</a></td>\n\t\t\t<td rowspan="2" width="27%"><a href="#comm-econ-insights">Communicating Economic Insights</a></td>\n\t\t'

And strip out some of the white space either side...

In [36]:
matchTableRow[0].strip()

'<th colspan="1" rowspan="2" scope="row" width="5%">Fall</th>\n\t\t\t<th scope="row" width="12%">Session 1</th>\n\t\t\t<td width="34%"><a href="#firms-markets">Individuals, Firms, and Markets</a></td>\n\t\t\t<td width="27%"><a href="#quant-methods">Quantitative Methods</a></td>\n\t\t\t<td rowspan="2" width="27%"><a href="#comm-econ-insights">Communicating Economic Insights</a></td>'

Or remove both tags and white space.

In [37]:
stripText(stripText(matchTableRow[0] ,tagMatch," "), whitespaceMatch," ")

' Fall Session 1 Individuals, Firms, and Markets Quantitative Methods Communicating Economic Insights '

We can find the table headers

In [38]:
matchTableHeader = re.findall(r'<th.*?>(.*?)<\/th>', WebsiteHTML, re.M|re.I|re.S)
matchTableHeader

['Fall', 'Session 1', 'Session 2', 'Spring', 'Session 1', 'Session 2']

In [39]:
match_list=[]
for i in range(0,len(matchTableRow)):
    match_list.append( re.findall(r'<td.*?>.*<a.*>(.*?)<\/a>.*<\/td>', matchTableRow[i], re.M|re.I|re.S) )
pd.DataFrame(match_list)

Unnamed: 0,0
0,Communicating Economic Insights
1,Economic Inference from Data
2,Data Design for Economic Applications (Capstone)
3,Big Data and Forecasting in Economics
4,Incentives and Information


We can also index where we found things...

In [40]:
outTable=[]
matchTable =re.findall(r'<table.*?>(.*?)<\/table>', WebsiteHTML, re.M|re.I|re.S)
for i in range(0,len(matchTable)):
    matchTableRow =re.findall(r'<tr.*?>(.*?)<\/tr>',matchTable[i], re.M|re.I|re.S)
    for j in range(0,len(matchTableRow)):
        matchTableData=re.findall(r'<td.*?>(.*?)<\/td>',matchTableRow[j], re.M|re.I|re.S)
        for k in range(0,len(matchTableData)):
            outTable.append({"Table": i+1, "Row": j+1, "Cell" : k+1 , "Content" : stripText(matchTableData[k])  })
pd.DataFrame(outTable)

Unnamed: 0,Table,Row,Cell,Content
0,1,1,1,"Individuals, Firms, and Markets"
1,1,1,2,Quantitative Methods
2,1,1,3,Communicating Economic Insights
3,1,2,1,Incentives and Information
4,1,2,2,Economic Inference from Data
5,1,3,1,Global Economics and Finance
6,1,3,2,Applications of Economic Analysis Techniques
7,1,3,3,Data Design for Economic Applications (Capstone)
8,1,4,1,"Evidence-Based Analysis in Labor, Public and H..."
9,1,4,2,Big Data and Forecasting in Economics


## More modern methods
Even though knowing regular expressions can be really useful, if we're dealing with HTML, there are other ways to get structured information out of it.

We'll briefly look at two libraries that can do this:
*  Beautiful Soup
* lxml.html

Though hopefully Andy taught you more about these!

In [23]:
from bs4 import BeautifulSoup
# Convert out Website HTML into a
soup=BeautifulSoup(WebsiteHTML,'html.parser')
type(soup)

bs4.BeautifulSoup

In [42]:
soupImages=soup.findAll("img")
for image in soupImages:
    print(image["src"])

/sites/all/themes/pitt_theme/img/pitt_logo_2019.png


In [43]:
soupLinks=soup.findAll("a", attrs = {'href' : True})
for link in soupLinks:
    print(link["href"])

#main-content
http://www.pitt.edu
http://www.pitt.edu
http://as.pitt.edu/
https://www.econ.pitt.edu
/
/
https://www.linkedin.com/company/pittms-quantitative-economics
https://www.facebook.com/mqepitt
https://www.instagram.com/mqepitt/
mailto:mqeinfo@pitt.edu
/why-masters-economics
/about/join-our-liveable-campus-city
/about/connect-mqe
/admissions
/admissions/apply-now
/admissions/mqe-mpa-joint-degree
/admissions/faq-admissions
/admissions/tuition-and-fees
/admissions/student-resources
/careers
/careers/career-services
/careers/capstone-projects
/careers/capstone-projects/anthem-medicare-advantage-market-predictor
/careers/capstone-projects/buffalo-sabres-contract-projections
/careers/capstone-projects/cgipnc-consumer-input
/careers/capstone-projects/city-pgh-optimizing-waste-collection
/careers/capstone-projects/fourth-economy-assessing-covid-policies
/careers/capstone-projects/sheetz-spatial-modeling-demand
/economics-data-science-communication-courses
https://www.mqe.pitt.edu/sites/

In [44]:
# Use a regular expression in the find for any href which has an internal link 
soupLinksInternal=soup.findAll("a", {"href": re.compile("^\/.*$") })
for link in soupLinksInternal:
    print(link["href"])

/
/
/why-masters-economics
/about/join-our-liveable-campus-city
/about/connect-mqe
/admissions
/admissions/apply-now
/admissions/mqe-mpa-joint-degree
/admissions/faq-admissions
/admissions/tuition-and-fees
/admissions/student-resources
/careers
/careers/career-services
/careers/capstone-projects
/careers/capstone-projects/anthem-medicare-advantage-market-predictor
/careers/capstone-projects/buffalo-sabres-contract-projections
/careers/capstone-projects/cgipnc-consumer-input
/careers/capstone-projects/city-pgh-optimizing-waste-collection
/careers/capstone-projects/fourth-economy-assessing-covid-policies
/careers/capstone-projects/sheetz-spatial-modeling-demand
/economics-data-science-communication-courses
/people
/people/faculty
/people/students
/people/advisory-board
/people/mentors
/people/mqe-student-stories
/people/mqe-student-stories/santiago-risco
/people/mqe-student-stories/megan-kveragas
/people/mqe-student-stories/milan-stefanelli-mqe-class-2022
/people/mqe-student-stories/yuxi

  soupLinksInternal=soup.findAll("a", {"href": re.compile("^\/.*$") })


In [45]:
# this just inverts the for syntax, to look at tll of the children within the soup object
for item in list(soup.children):
    print(type(item) )

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>


In [46]:
soupHTML=list(soup.children)[2]
for item in list(soupHTML.children):
    print(type(item) )
# list(soupHTML.children)[1] this looks like scripts at the tope
# list(soupHTML.children)[3] this looks like the main content.
mainBody=list(list(soupHTML.children)[3].children)
for item in list(soupHTML.children):
    print(type(item) )

<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>


So though we can continue to break the HTML code into parseable sections, we can also quickly filter out particular types of tags

In [48]:
for row in soup.select('table td'):
    print(row)

<td width="34%"><a href="#firms-markets">Individuals, Firms, and Markets</a></td>
<td width="27%"><a href="#quant-methods">Quantitative Methods</a></td>
<td rowspan="2" width="27%"><a href="#comm-econ-insights">Communicating Economic Insights</a></td>
<td>Incentives and Information </td>
<td><a href="#econ-interference">Economic Inference from Data</a></td>
<td>Global Economics and Finance</td>
<td><a href="#econ-analysis-tech">Applications of Economic Analysis Techniques</a></td>
<td rowspan="2"><a href="#data-design">Data Design for Economic Applications (Capstone)</a></td>
<td><a href="#evidence-based">Evidence-Based Analysis in Labor, Public and Health Economics</a></td>
<td><a href="#big-data">Big Data and Forecasting in Economics</a></td>
<td><a href="#incentives-info">Incentives and Information</a></td>


In [49]:
for row in soup.select('table td'):
    print(row.get_text())

Individuals, Firms, and Markets
Quantitative Methods
Communicating Economic Insights
Incentives and Information 
Economic Inference from Data
Global Economics and Finance
Applications of Economic Analysis Techniques
Data Design for Economic Applications (Capstone)
Evidence-Based Analysis in Labor, Public and Health Economics
Big Data and Forecasting in Economics
Incentives and Information


In [50]:
for row in soupHTML.select('table th'):
    print(row.get_text()) # Here using the get_text() method to remove the tags

Fall
Session 1
Session 2
Spring
Session 1
Session 2


In [51]:
soup.find_all(id='global-econ')[0] #looking for any tag with a particular id

<a id="global-econ" name="global-econ"></a>

Here we select all of the `<a>` tags withing an `<h4>` tag

In [52]:
soup.select("h4 a")

[<a id="firms-markets" name="firms-markets"></a>,
 <a id="quant-methods" name="quant-methods"></a>,
 <a id="incentives-info" name="incentives-info"></a>,
 <a id="comm-econ-insights" name="comm-econ-insights"></a>,
 <a id="econ-interference" name="econ-interference"></a>,
 <a id="global-econ" name="global-econ"></a>,
 <a id="econ-analysis-tech" name="econ-analysis-tech"></a>,
 <a id="evidence-based" name="evidence-based"></a>,
 <a id="big-data" name="big-data"></a>,
 <a id="data-design" name="data-design"></a>]

Another package is `lxml`

In [6]:
%pip install lxml

Collecting lxml
  Downloading lxml-5.3.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.7 kB)
Downloading lxml-5.3.1-cp312-cp312-manylinux_2_28_x86_64.whl (5.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: lxml
Successfully installed lxml-5.3.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
import lxml.html as lh
import pandas as pd
doc=lh.fromstring(WebsiteHTML)

In [12]:
listAll=[]
tr_elems=doc.xpath('//tr') # look for tr tags
for i in range(0,len(tr_elems)): # Loop across the table row elements
    dictI={} # create an empty dictionary
    for j  in range( 0 , len( tr_elems[i] ) ): # for each of the elements in the row j =0 to n-1
        dictI[j+1]= tr_elems[i][j].text_content() # Add the entry to key j 
    listAll.append( dictI  )
pd.DataFrame(listAll)

Unnamed: 0,1,2,3,4,5
0,Fall,Session 1,"Individuals, Firms, and Markets",Quantitative Methods,Communicating Economic Insights
1,Session 2,Incentives and Information,Economic Inference from Data,,
2,Spring,Session 1,Global Economics and Finance,Applications of Economic Analysis Techniques,Data Design for Economic Applications (Capstone)
3,Session 2,"Evidence-Based Analysis in Labor, Public and H...",Big Data and Forecasting in Economics,,
4,Incentives and Information,,,,


Let's take a look at grabbing some Hockey information from the table on [Cap Friendly](https://www.capfriendly.com/) front page

In [19]:
response=requests.get("https://www.capwages.com/")

In [20]:
response.status_code

200

In [21]:
capFriendlyHTML=response.text
capFriendlyHTML[0:200]

'<!DOCTYPE html><html lang="en"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>CapWages | NHL Salary Data</title><meta name="description" con'

In [24]:
cfDoc=lh.fromstring(capFriendlyHTML) #using lxml.html
cfsoup=BeautifulSoup(capFriendlyHTML,'html.parser') #using Beautiful Soup

Looking at the website, this looks like there's one big table, and several small ones, the fourth one has all the cap information.

In [28]:
table_soup[0]

<table class="table__default table__stickyFirstColumn w-full text-center border-collapse"><thead class="bg-blue-800 text-white dark:text-gray-100 uppercase text-xs sm:text-sm font-sans font-normal cursor-pointer"><tr><th class="text-left px-1">Team</th><th class="px-1 cursor-pointer">Cap Space</th><th class="px-1 cursor-pointer">Cap Hit</th><th class="px-1 cursor-pointer">Roster Size</th><th class="px-1 cursor-pointer">Contracts</th><th class="px-1 cursor-pointer">Retained Slots</th></tr></thead><tbody><tr class="bg-gray-100 dark:bg-gray-800 text-xs"><td class="text-left px-1 py-1 font-normal text-sm md:text-sm"><a class="text-blue-700 dark:text-blue-400 flex items-center min-w-16" data-value="Tampa Bay Lightning" href="/teams/tampa_bay_lightning"><img alt="Tampa Bay Lightning logo" class="w-4 h-4 mr-2" data-nimg="1" decoding="async" height="32" loading="lazy" src="data/tampa_bay_lightning.svg" style="color:transparent" width="32"/><span class="hidden md:block">Tampa Bay Lightning</spa

In [None]:
table_soup=cfsoup.select('table')
for child in list(table_soup[2].children):
    print(child)
    

In [None]:
listAll=[]
table_elems=cfDoc.xpath('//table')
tr_elems=cfDoc.xpath('//tr')
table_elems[1].text_content()

In [None]:
listAll

In [None]:
for i in range(0,len(tr_elems)):
    if len(tr_elems[i])==8: # let's only select table rows with eight elements (which should be the team cap hits by year)
        dictI={} # create an empty dictionary
        for j  in range( 0 , len( tr_elems[i] ) ): # for each of the elements in the row j =0 to n-1
            dictI[j+1]= tr_elems[i][j].text_content() # Add the entry to key j 
        listAll.append( dictI  )  # add the dictionary to the list
mainTable=pd.DataFrame(listAll[1:]) # get rid of row zero from the data
colNames= [str.lower() for str in listAll[0].values()] 

mainTable.columns=colNames # reasssign column names

Let's grab the team names

In [None]:
teams=mainTable["team"].unique()
teams

Note that we can do for loops inside of list of dict definitions in python so that:

In [None]:
my_list=[str.lower() for str in listAll[0].values() ] 
my_list

Is equivalent to doing it one by one:

In [None]:
my_list=[]
for str in listAll[0].values():
    my_list.append (str.lower())
my_list

In [None]:
mainTable

## BBC Sports Tables

In [14]:
resp=requests.get('https://www.bbc.com/sport/football/premier-league/table')

In [15]:
doc=lh.fromstring(resp.text)
listAll=[]
tr_elems=doc.xpath('//tr') # look for tr tags
for i in range(0,len(tr_elems)):
    dictI={} # create an empty dictionary
    for j  in range( 0 , len( tr_elems[i] ) ): # for each of the elements in the row j =0 to n-1
        dictI[j+1]= tr_elems[i][j].text_content() # Add the entry to key j 
    listAll.append( dictI  )
premTable= pd.DataFrame(listAll[1:-1])
premTable.columns=[str.lower() for str in listAll[0].values()] 
premTable

Unnamed: 0,position,team,played,won,drawn,lost,goals for,goals against,goal difference,points,"form, last 6 games, oldest first"
0,1,Liverpool,29,21,7,1,69,27,42,70,DResult DrawWResult WinDResult DrawWResult Win...
1,2,Arsenal,29,16,10,3,53,24,29,58,WResult WinWResult WinLResult LossDResult Draw...
2,3,Nottingham Forest,29,16,6,7,49,35,14,54,WResult WinLResult LossLResult LossDResult Dra...
3,4,Chelsea,29,14,7,8,53,37,16,49,WResult WinLResult LossLResult LossWResult Win...
4,5,Manchester City,29,14,6,9,55,40,15,48,LResult LossWResult WinLResult LossWResult Win...
5,6,Newcastle United,28,14,5,9,47,38,9,47,WResult WinLResult LossLResult LossWResult Win...
6,7,Brighton & Hove Albion,29,12,11,6,48,42,6,47,LResult LossWResult WinWResult WinWResult WinW...
7,8,Fulham,29,12,9,8,43,38,5,45,WResult WinWResult WinLResult LossWResult WinL...
8,9,Aston Villa,29,12,9,8,41,45,-4,45,LResult LossDResult DrawDResult DrawWResult Wi...
9,10,AFC Bournemouth,29,12,8,9,48,36,12,44,LResult LossWResult WinLResult LossLResult Los...


And formalizing to get other tables with the stub:

In [16]:
def bbcGetTable(stub):
    resp=requests.get('https://www.bbc.com/sport/football/'+stub+'/table')
    doc=lh.fromstring(resp.text)
    listAll=[]
    tr_elems=doc.xpath('//tr') # look for tr tags
    for i in range(0,len(tr_elems)):
        dictI={} # create an empty dictionary
        for j  in range( 0 , len( tr_elems[i] ) ): # for each of the elements in the row j =0 to n-1
            dictI[j+1]= tr_elems[i][j].text_content() # Add the entry to key j 
        listAll.append( dictI  )
    outTable= pd.DataFrame(listAll[1:-1])
    outTable.columns=[str.lower() for str in listAll[0].values()] 
    return outTable

In [17]:
bbcGetTable('german-bundesliga')

Unnamed: 0,position,team,played,won,drawn,lost,goals for,goals against,goal difference,points,"form, last 6 games, oldest first"
0,1,Bayern Munich,26,19,5,2,75,24,51,62,WResult WinDResult DrawWResult WinWResult WinL...
1,2,Bayer Leverkusen,26,16,8,2,59,33,26,56,DResult DrawDResult DrawWResult WinWResult Win...
2,3,Mainz 05,26,13,6,7,44,28,16,45,DResult DrawWResult WinWResult WinWResult WinW...
3,4,Eintracht Frankfurt,26,13,6,7,54,40,14,45,DResult DrawWResult WinLResult LossLResult Los...
4,5,RB Leipzig,26,11,9,6,41,33,8,42,WResult WinDResult DrawDResult DrawLResult Los...
5,6,Freiburg,26,12,6,8,36,38,-2,42,WResult WinWResult WinWResult WinDResult DrawD...
6,7,Borussia M'gladbach,26,12,4,10,43,40,3,40,DResult DrawWResult WinLResult LossWResult Win...
7,8,Wolfsburg,26,10,8,8,49,40,9,38,DResult DrawWResult WinDResult DrawWResult Win...
8,9,Augsburg,26,10,8,8,29,35,-6,38,DResult DrawDResult DrawWResult WinDResult Dra...
9,10,Stuttgart,26,10,7,9,47,43,4,37,WResult WinLResult LossDResult DrawLResult Los...
