# Webscraping In Python 
## Python 3.6 ([Python 2 EOL in ~12 months](https://pythonclock.org/))
## PyHou Meetup Dec18
---
##### [Ryan Thielke ](https://www.linkedin.com/in/ryan-thielke-b987012a/) 
##### [Two Sigma ](https://www.twosigma.com) 

# Agenda
---
### 1) Requests (One of the most imporant libs to know!)

### 2) BeautifulSoup

### 3) Q&A - But feel free to stop me at any time

# Motivation
---
* Every year I tell myself I'm going to use data to build the best fantasy football team ever! (never pans out)
* Learn how to interact with web services via HTTP 
* Learn how to retreive and store data for later analysis

### But before you go build a custom scraper...
---
* See if they offer an API
* [Respect robots.txt](http://www.robotstxt.org/)
* Be mindful of new Javascript frameworks (React) - Might need to use Selenium or headless Chrome
* High support burden as website owners make changes

### HTTP Verbs
---
* **GET**     - Returns a resource from a given URL
* **HEAD**    - Like GET, but only return the headers
* **PUT**     - Creates or overwrites a resource
* **POST**    - Like PUT, but will modify or update a resource
* **DELETE**  - Removes data from the resource
* **OPTIONS** - See available HTTP methods at an endpoint

[And a few others - see the wikipedia page for more info](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods)

# Requests: HTTP For Humans
---
[Great Documentation](https://docs.python-requests.org/en/master)

#### ALL the Features:
* Keep-Alive & Connection Pooling
* International Domains and URLs
* Sessions with Cookie Persistence
* Browser-style SSL Verification
* Automatic Content Decoding
* Basic/Digest Authentication
* Elegant Key/Value Cookies
* Automatic Decompression
* Unicode Response Bodies
* HTTP(S) Proxy Support
* Multipart File Uploads
* Streaming Downloads
* Connection Timeouts
* Chunked Requests
* .netrc Support


### [Start Flask Server](http://localhost:3000)

In [1]:
# A very simple GET 
import requests

BASE_URL = "http://localhost:3000"

response = requests.get(BASE_URL)
print(response.text)


    <html>
        <body>
            <h1>Python User Group Meetup December 2018</h1>
            <ul>
                <li>
                    <a href="/rest">Sample JSON Data</a>
                </li>
                <li>
                    <a href="/error">Sample 500 Error</a>
                </li>
                <li>
                    <a href="/beautiful_soup">Sample HTML</a>
                </li>
            </ul>
        </body>
    </html>
    


In [2]:
# Three ways to consume content from the resource
from urllib.parse import urljoin

BASE_URL = "http://localhost:3000"
JSON_ENDPOINT = "rest" # Json data

response = requests.get(urljoin(BASE_URL, JSON_ENDPOINT))

print(response.content)
print(type(response.content), end='\n\n')
print(response.text)
print(type(response.text), end='\n\n')
print(response.json()) # json.loads(response.text)
print(type(response.json()), end='\n\n')

b'{"first_name": "Ryan", "last_name": "Thielke", "location": "Python User Group Meetup", "employer": "Two Sigma"}'
<class 'bytes'>

{"first_name": "Ryan", "last_name": "Thielke", "location": "Python User Group Meetup", "employer": "Two Sigma"}
<class 'str'>

{'first_name': 'Ryan', 'last_name': 'Thielke', 'location': 'Python User Group Meetup', 'employer': 'Two Sigma'}
<class 'dict'>



In [3]:
# Always good to check your status codes
# Fail early and often 
BASE_URL = "http://localhost:3000"
ERROR_ENDPOINT = "error"
response = requests.get(urljoin(BASE_URL, ERROR_ENDPOINT))

print(response.headers)
print(response.status_code)
response.raise_for_status()

{'Content-Type': 'text/html', 'Content-Length': '291', 'Server': 'Werkzeug/0.14.1 Python/3.6.1', 'Date': 'Wed, 19 Dec 2018 01:30:17 GMT'}
500


HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:3000/error

In [4]:
# I don't remember all the http codes
# Luckily requests makes it easy
dir(requests.codes)

['-O-',
 '-o-',
 '/o\\',
 'ACCEPTED',
 'ALL_GOOD',
 'ALL_OK',
 'ALL_OKAY',
 'ALREADY_REPORTED',
 'BAD',
 'BAD_GATEWAY',
 'BAD_REQUEST',
 'BANDWIDTH',
 'BANDWIDTH_LIMIT_EXCEEDED',
 'BLOCKED_BY_WINDOWS_PARENTAL_CONTROLS',
 'CHECKPOINT',
 'CLIENT_CLOSED_REQUEST',
 'CONFLICT',
 'CONTINUE',
 'CREATED',
 'DEPENDENCY',
 'EXPECTATION_FAILED',
 'FAILED_DEPENDENCY',
 'FIELDS_TOO_LARGE',
 'FORBIDDEN',
 'FOUND',
 'GATEWAY_TIMEOUT',
 'GONE',
 'HEADER_FIELDS_TOO_LARGE',
 'HTTP_VERSION',
 'HTTP_VERSION_NOT_SUPPORTED',
 'IM_A_TEAPOT',
 'IM_USED',
 'INSUFFICIENT_STORAGE',
 'INTERNAL_SERVER_ERROR',
 'I_AM_A_TEAPOT',
 'LEGAL_REASONS',
 'LENGTH_REQUIRED',
 'LOCKED',
 'MEDIA_TYPE',
 'METHOD_NOT_ALLOWED',
 'MISDIRECTED_REQUEST',
 'MOVED',
 'MOVED_PERMANENTLY',
 'MULTIPLE_CHOICES',
 'MULTIPLE_STATI',
 'MULTIPLE_STATUS',
 'MULTI_STATI',
 'MULTI_STATUS',
 'NETWORK_AUTH',
 'NETWORK_AUTHENTICATION',
 'NETWORK_AUTHENTICATION_REQUIRED',
 'NONE',
 'NON_AUTHORITATIVE_INFO',
 'NON_AUTHORITATIVE_INFORMATION',
 'NOT_AC

### https://httpbin.org
Also developed by Ken Reitz

In [6]:
# Sessions will persist parameters (cookies, auth, proxies) across multiple calls
import pprint

BASE_URL = "http://httpbin.org/"
query_params = {'name': 'Ryan Thielke', 
                'passphrase': '%this$ @is# a secret phrase'}

with requests.Session() as session:
    session.auth = ('my_username', 'my_password')
    
    response1 = session.get(urljoin(BASE_URL, 'get'),
                            params=query_params)
    response2 = session.get(urljoin(BASE_URL, 'get'))
    
    pprint.pprint(response2.json())

{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Authorization': 'Basic bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ=',
             'Connection': 'close',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.19.1'},
 'origin': '98.196.168.3',
 'url': 'http://httpbin.org/get'}


In [9]:
# Let's post some data
BASE_URL = "http://httpbin.org/"
payload = {'key1': 'value1', 'key2': 'value2'}

with requests.Session() as session:
    # If the endpoint expects json data - also sets te Content-Type header
    response1   = session.post(urljoin(BASE_URL, 'post'), json=payload)
    # But if you want to send binary data or a file or something...
    response2  = session.post(urljoin(BASE_URL, 'post'), data=b'sending you some bytes') 
    
    pprint.pprint(response2.json())

{'args': {},
 'data': 'sending you some bytes',
 'files': {},
 'form': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Connection': 'close',
             'Content-Length': '22',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.19.1'},
 'json': None,
 'origin': '98.196.168.3',
 'url': 'http://httpbin.org/post'}


In [10]:
# A little anticlimactic, but all the other verbs too 
BASE_URL = "http://httpbin.org/"

with requests.Session() as session:
    put     = session.put(BASE_URL + 'put')
    delete  = session.delete(BASE_URL + 'delete')
    head    = session.head(BASE_URL + 'get')
    options = session.options(BASE_URL + 'get')    

# BeautifulSoup 
---
[The Docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)<br><br>
Python has built in HTML and XML parsers, but I always choose BeautifulSoup when I need to extract data from websites
<br><br>
**Don't use regex!**

In [11]:
from bs4 import BeautifulSoup

SAMPLE_HTML = """
<html>
    <head>
        <title>SAMPLE HTML</title>
        <script rel="stylesheet" href="/some/path/to/css" type="text/css">Some CSS</script>
        <script src="/some/path/to/javascript" charset="utf-8" type="text/javascript">Some JS</script>
    </head>
    <body>
        <p class="one" id="1">This is paragraph 1</p>
        <p class="two" id="2">This is paragraph 2</p>
        <p class="two" id="3">This is paragraph 3</p>
        <div>
            <p>This is a paragraph with no properties</p>
        </div>
    </body>
</html>
"""
soup = BeautifulSoup(SAMPLE_HTML, 'html5lib')
#or
#soup = BeautifulSoup(SAMPLE_HTML, 'lxml') if you have lxml installed

In [12]:
# Under the hood, BeautifulSoup creates a tree-like structure of the DOM
# Here are a few ways to navigate the DOM

title_tag = soup.head.title
print("Title tag: {}".format(title_tag))

Title tag: <title>SAMPLE HTML</title>


In [13]:
title_tag_text = soup.head.title.get_text()
print("Title Tag Text: {}".format(title_tag_text))

Title Tag Text: SAMPLE HTML


In [14]:
# You can also access properties of the tags
# This get's the first occurance!
script_tag = soup.head.script
print("Script Tag: {}".format(script_tag))

Script Tag: <script href="/some/path/to/css" rel="stylesheet" type="text/css">Some CSS</script>


In [15]:
script_tag_text = soup.head.script.get_text()
print("Script Tag Text: {}".format(script_tag_text))

Script Tag Text: Some CSS


In [16]:
script_tag_href = soup.head.script.get("href")
print("Script Tag href: {}".format(script_tag_href))

Script Tag href: /some/path/to/css


In [17]:
script_tag_rel = soup.head.script.get("rel")
print("Script Tag Rel: {}".format(script_tag_rel))

Script Tag Rel: stylesheet


In [18]:
script_tag_type = soup.head.script["type"]
print("Script Tag Type: {}".format(script_tag_type))

Script Tag Type: text/css


In [19]:
script_tag_doesnt_exist = soup.head.script.get("somePropThatDoesntExist")
print("Some Prop That Doesn't Exist: {}".format(script_tag_doesnt_exist))

Some Prop That Doesn't Exist: None


In [20]:
# I use this method A LOT
# Find all the paragaph elements (<p>)
soup.find_all('p')

[<p class="one" id="1">This is paragraph 1</p>,
 <p class="two" id="2">This is paragraph 2</p>,
 <p class="two" id="3">This is paragraph 3</p>,
 <p>This is a paragraph with no properties</p>]

In [21]:
# Or apply filters to find only elements of interest
soup.find_all('p', {"class": "two"})

[<p class="two" id="2">This is paragraph 2</p>,
 <p class="two" id="3">This is paragraph 3</p>]

In [22]:
# Or match based on regex
import re
regex = re.compile("\d+") # 1 or more digits
soup.find_all("p", {"id": regex})

[<p class="one" id="1">This is paragraph 1</p>,
 <p class="two" id="2">This is paragraph 2</p>,
 <p class="two" id="3">This is paragraph 3</p>]

In [23]:
# Or elements with no properties
soup.find_all("p", {"id": False})

[<p>This is a paragraph with no properties</p>]

In [24]:
# Lastly there is also a .select syntax
# I only want to find p's that are children of div's
soup.select("div p")

[<p>This is a paragraph with no properties</p>]

In [25]:
# Similar, only want to find <scripts> that are in <head>
soup.head.find_all("script")

[<script href="/some/path/to/css" rel="stylesheet" type="text/css">Some CSS</script>,
 <script charset="utf-8" src="/some/path/to/javascript" type="text/javascript">Some JS</script>]

In [26]:
class HTMLTable:
    """A class that will parse the first HTML Table"""
    def __init__(self, html, html_parser='html5lib'):
        self.html    = html
        self.soup    = BeautifulSoup(html, html_parser)
        self.table   = self.soup.find('table')
        self.headers = None
        self.data    = None
        
    def get_headers(self):
        """Returns the values in the thead element of the table"""
        if self.headers:
            return self.headers
        self.headers = [hdr.get_text() for hdr in self.table.thead.select('tr th')]
        return self.headers
    
    def get_data(self):
        """Returns the values in the tbody element of the table"""
        if self.data:
            return self.data
        data = []
        for tr in self.table.tbody.find_all("tr"):
            data.append([td.get_text().replace(" ", "").replace("\n", "") for td in tr.select('td')])
        self.data = data
        return self.data
    
    def write_csv(self, filename):
        """Write a csv of the parsed html table"""
        import csv
        print("Writing the table to {}".format(filename))
        with open(filename, 'w') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(self.get_headers()) # First write out the headers
            writer.writerows(self.get_data())  # Then write out all the data
        print("Done!")
        
    def to_dataframe(self):
        """Returns a pandas dataframe of the table"""
        import pandas as pd
        return pd.DataFrame(self.get_data(), columns=self.get_headers())

In [27]:
# Putting it all together...
BASE_URL = 'http://www.coinmarketcap.com'
# Set a custom user-agent header
HEADERS = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0)\
                          AppleWebKit/537.36 (KHTML, like Gecko)\
                          Chrome/66.0.3359.139 Safari/537.36'}
response = requests.get(BASE_URL, headers=HEADERS)
# Check that we got an 'ok' status
response.raise_for_status()
# Parse the content into a string (watch out for unicode!)
raw_html = response.content.decode("utf-8")

table = HTMLTable(raw_html)

In [28]:
table.get_headers()

['#',
 'Name',
 'Market Cap',
 'Price',
 'Volume (24h)',
 'Circulating Supply',
 'Change (24h)',
 'Price Graph (7d)',
 '']

In [29]:
table.get_data()[:3]

[['1',
  'BTCBitcoin',
  '$64,972,338,182',
  '$3727.57',
  '$6,163,055,961',
  '17,430,225BTC',
  '5.07%',
  '',
  'AddtoWatchlistRemovefromWatchlistWatchlistfull!ViewChartViewMarketsViewHistoricalData'],
 ['2',
  'XRPXRP',
  '$14,701,508,657',
  '$0.360664',
  '$803,839,421',
  '40,762,365,544XRP*',
  '8.38%',
  '',
  'AddtoWatchlistRemovefromWatchlistWatchlistfull!ViewChartViewMarketsViewHistoricalData'],
 ['3',
  'ETHEthereum',
  '$10,779,721,316',
  '$103.77',
  '$2,450,512,804',
  '103,878,919ETH',
  '9.35%',
  '',
  'AddtoWatchlistRemovefromWatchlistWatchlistfull!ViewChartViewMarketsViewHistoricalData']]

In [30]:
import datetime
now = datetime.datetime.now().strftime("%Y%m%d")
table.write_csv("CoinMarketCap" + now + ".csv")

Writing the table to CoinMarketCap20181218.csv
Done!


In [31]:
%%bash
TODAY=$(date +%Y%m%d)
head CoinMarketCap${TODAY}.csv | column -t -s,

#  Name            Market Cap  Price  Volume (24h)  Circulating Supply  Change (24h)  Price Graph (7d)  
1  BTCBitcoin      "$64        972    338           182"                $3727.57      "$6               163  055   961"  "17   430      225BTC"   5.07%                                                                                  AddtoWatchlistRemovefromWatchlistWatchlistfull!ViewChartViewMarketsViewHistoricalData
2  XRPXRP          "$14        701    508           657"                $0.360664     "$803             839  421"  "40   762   365      544XRP*"  8.38%                                                                                  AddtoWatchlistRemovefromWatchlistWatchlistfull!ViewChartViewMarketsViewHistoricalData
3  ETHEthereum     "$10        779    721           316"                $103.77       "$2               450  512   804"  "103  878      919ETH"   9.35%                                                                                  AddtoWatchlistRemovef

In [32]:
df = table.to_dataframe()
df.head()

Unnamed: 0,#,Name,Market Cap,Price,Volume (24h),Circulating Supply,Change (24h),Price Graph (7d),Unnamed: 9
0,1,BTCBitcoin,"$64,972,338,182",$3727.57,"$6,163,055,961","17,430,225BTC",5.07%,,AddtoWatchlistRemovefromWatchlistWatchlistfull...
1,2,XRPXRP,"$14,701,508,657",$0.360664,"$803,839,421","40,762,365,544XRP*",8.38%,,AddtoWatchlistRemovefromWatchlistWatchlistfull...
2,3,ETHEthereum,"$10,779,721,316",$103.77,"$2,450,512,804","103,878,919ETH",9.35%,,AddtoWatchlistRemovefromWatchlistWatchlistfull...
3,4,EOSEOS,"$2,381,335,648",$2.63,"$1,381,745,295","906,245,118EOS*",8.36%,,AddtoWatchlistRemovefromWatchlistWatchlistfull...
4,5,XLMStellar,"$2,279,829,555",$0.118920,"$110,773,791","19,171,055,281XLM*",8.18%,,AddtoWatchlistRemovefromWatchlistWatchlistfull...


## Any Questions?