# Data Science - Web Scraping

## Tasks Today:

1) <b>Requests</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Requests <br>
2) <b>Beautiful Soup</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) .prettify() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Converting to a List <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Extracting Beautiful Soup Elements <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Assigning Variables from Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) .find() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) .find_all() <br>
3) <b>Exercise</b> <br>

## Requests

In [1]:
# Install Beautiful Soup
!pip install beautifulsoup4
!pip install requests

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 1.3 MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.2-py3-none-any.whl (33 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.2


In [None]:
# https://www.arthurleej.com/e-love.html
# https://requests.readthedocs.io/en/master/

### Importing

In [2]:
import requests

### Using Requests

In [3]:
# Connect to URL
page = requests.get('http://www.arthurleej.com/e-love.html')

In [4]:
# display result response
page

<Response [200]>

##### .content()

In [5]:
# Check Status of request response
page.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\r<html>\r<head>\r\t<title>Essay on Love by Arthur Lee Jacobson</title>\r<meta name="description" content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson.">\r<meta name="keywords" content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington">\r<meta name="resource-type" content="document">\r<meta name="generator" content="BBEdit 4.5">\r<meta name="robots" content="all">\r<meta name="classification" content="Gardening">\r<meta name="distribution" content="global">\r<meta name="rating" content="general">\r<meta name="copyright" content="2001 Arthur Lee Jacobson">\r<meta name="author" content="eriktyme@eriktyme.com">\r<meta name="language" content="en-us">\r</head>\r<body background="images/background1a.jpg" bgcolor="#FFFFCC" text="#000000" link="#00

## Beautiful Soup

### Importing

In [6]:
from bs4 import BeautifulSoup

### Using Beautiful Soup

In [7]:
# Instantiate BeautifulSoup class
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
 <html> <head> <title>Essay on Love by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000FF" t

### .prettify()

In [8]:
#NOTE: Prettify only works for the full document and the .find() method
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Essay on Love by Arthur Lee Jacobson
  </title>
  <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/>
  <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/>
  <meta content="document" name="resource-type"/>
  <meta content="BBEdit 4.5" name="generator"/>
  <meta content="all" name="robots"/>
  <meta content="Gardening" name="classification"/>
  <meta content="global" name="distribution"/>
  <meta content="general" name="rating"/>
  <meta content="2001 Arthur Lee Jacobson" name="copyright"/>
  <meta content="eriktyme@eriktyme.com" name="author"/>
  <meta content="en-us" name="language"/>
 </head>
 <body alink="#33CC33" background="images/background1a.jpg" b

### Converting to a List

In [10]:
# Tags may contain strings and other tags. These elements are the tag’s children.
list(soup.children)
print(f'\n Count of children: {len(list(soup.children))}')


 Count of children: 4


### Extracting Beautiful Soup Elements

In [11]:
# We can traverse through an HTML page and extract other tags and text
# The below example shows the types of iterables available in the object created from the HTML Document
# .Tag allows us to dive deeper into the document i.e we can look for HTML attributes like .class and if needed go deeper into the document from there
[type(item) for item in list(soup.children)]

[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

### Assinging Variables from Beautiful Soup

In [15]:
import pprint

html = list(soup.children)[2] # selecting the html element child from the soup object
body = list(html.children)[3] # selecting the body from the HTML child
center = list(body.children)[4]
table = list(center.children)[0]

print(table.prettify())

<table border="0" cellpadding="1" cellspacing="2">
 <tr>
  <td align="center" valign="top" width="480">
   <table border="0" cellpadding="1" cellspacing="2">
    <tr>
     <td align="center" valign="top" width="480">
      <font size="5">
       <b>
        Love
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Love in its broad sense is the feeling of strong attraction, and

### .find() <br>
<p>Find a specific instance of the parameter passed in</p>

In [16]:
soup.find('b')

<b>Love</b>

### .find_all() <br>
<p>Similar to .find(), except this will return all of them instead of one</p>

In [22]:
text_corpus = soup.find_all('b')

raw_text = []
for i in text_corpus:
    raw_text.append(i.get_text())
    
raw_text

['Love',
 "\xa0\xa0\xa0\xa0Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?",
 '\xa0\xa0\xa0\xa0Love in its broad sense is the feeling of strong attraction, and often attachment and protection. It is felt towards other people, towards pets, towards inanimate objects, towards abstractions such as patriotism, religious matters, hobbies, and I suppose nearly everything. It is multifaceted, and includes ordinary self-love, chivalrous love, carnal or sexual love, friendly love, family love. It is an emotion that is closely related to certain others, such as hope. At its simplest level it is what we strongly like.',
 "\xa0\xa0\xa0\xa0I have a hunch that love, like

## Exercise <br>
<p>Using the Beautiful Soup library, grab the data from the following link: https://www.nbastuffer.com/2019-2020-nba-player-stats/. After getting the data, display the players name and team inside of a pandas dataframe.</p>

In [49]:
# Hint: Use the .get_text() method

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import collections

# page = requests.get('https://www.nbastuffer.com/2019-2020-nba-player-stats/')
# soup = BeautifulSoup(page.content, 'html.parser')
# trall = soup.find_all('tr')
# for tr in trall:
#     tdall = list(tr.find_all('td'))
#     print(tdall)

page = requests.get('https://www.nbastuffer.com/2019-2020-nba-player-stats/')
soup = BeautifulSoup(page.content, 'html.parser')

names = []
teams = []

for node in soup.findAll('tr'):
    names.append(node.findAll(text=True)[1])
    teams.append(node.findAll(text=True)[2])
    
names.pop(0)
teams.pop(0)

player_data = pd.DataFrame.from_dict({
    'player_name': names,
    'team': teams
})

player_data

Unnamed: 0,player_name,team
0,Jaylen Adams,Por
1,Steven Adams,Okc
2,Bam Adebayo,Mia
3,Jarrett Allen,Bro
4,Justin Anderson,Bro
...,...,...
803,Trae Young,Atl
804,Cody Zeller,Cha
805,Tyler Zeller,San
806,Ante Zizic,Cle


# Bonus Example: Pulling Vegas Odds from PFR.com

<h3> Use this example for further reference</h3>
<p> This is an example that shows what we will get returned back to us when accessing a HTML document with Beautiful Soup</p>

In [50]:
page = requests.get('https://www.pro-football-reference.com/boxscores/201810140nwe.htm')
# print(page.status_code)

soup = BeautifulSoup(page.content, 'html.parser')

In [51]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" data-root="/home/pfr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
   <link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202102241" rel="dns-prefetch"/>
   <!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
   <script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScrip

In [52]:
for section in list(soup.children):
    print(section)
    print('1\n2\n3\n')



1
2
3

html
1
2
3



1
2
3

<html class="no-js" data-root="/home/pfr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202102241" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript)

In [53]:
html = list(soup.children)[3]

html

<html class="no-js" data-root="/home/pfr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202102241" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);
	
	function makeStub() {
	  

In [54]:
body = list(html.children)[3]

for el in list(body.children):
    print(el)
    print('\n\n\n\n123\n\n\n\n')







123




<div id="wrap">
<div id="header" role="banner">
<ul class="notranslate" id="subnav">
<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a></li>
<li><a href="https://www.baseball-reference.com/">Baseball</a></li>
<li class="current"><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>
<li><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>
<li><a href="https://www.hockey-reference.com/">Hockey</a></li>
<li><a href="https://fbref.com/pt/">Futebol</a></li>
<li><a href="https://www.sports-reference.com/blog/">Blog</a></li>
<li><a href="https://stathead.com/?utm_source=web&amp;utm_medium=pfr&amp;utm_campaign=sr-nav-bar-top-link">Stathead</a></li>
<li><a href="https://widgets.sports-reference.com/">Widgets</a></li>
<li><a href="#" onc

In [55]:
table = body.find_all('div')

print(table)

[<div id="wrap">
<div id="header" role="banner">
<ul class="notranslate" id="subnav">
<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a></li>
<li><a href="https://www.baseball-reference.com/">Baseball</a></li>
<li class="current"><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>
<li><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>
<li><a href="https://www.hockey-reference.com/">Hockey</a></li>
<li><a href="https://fbref.com/pt/">Futebol</a></li>
<li><a href="https://www.sports-reference.com/blog/">Blog</a></li>
<li><a href="https://stathead.com/?utm_source=web&amp;utm_medium=pfr&amp;utm_campaign=sr-nav-bar-top-link">Stathead</a></li>
<li><a href="https://widgets.sports-reference.com/">Widgets</a></li>
<li><a href="#" onclick="Freshwo

In [56]:
from bs4 import Comment

comments=soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    comment=BeautifulSoup(str(comment))
    log = comment.find('table', {'id':'game_info'}) #search as ordinary tag
    if log:
        print(log)

<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="game_info">
<caption>Game Info Table</caption>
<tr class="thead onecell"><td class="right center" colspan="2" data-stat="onecell">Game Info</td></tr>
<tr><th class="center" data-stat="info" scope="row">Won Toss</th><td class="center" data-stat="stat">Chiefs (deferred)</td></tr>
<tr><th class="center" data-stat="info" scope="row">Roof</th><td class="center" data-stat="stat">outdoors</td></tr>
<tr><th class="center" data-stat="info" scope="row">Surface</th><td class="center" data-stat="stat">fieldturf </td></tr>
<tr><th class="center" data-stat="info" scope="row">Duration</th><td class="center" data-stat="stat">3:07</td></tr>
<tr><th class="center" data-stat="info" scope="row">Attendance</th><td class="center" data-stat="stat"><a href="/years/2018/attendance.htm">65,878</a></td></tr>
<tr><th class="center" data-stat="info" scope="row">Weather</th><td class="center" data-stat="stat">45 degrees, no wind</td></tr>
