<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>
<br style="clear: both">
<hr>
<br>

<h1 align='center'>Web</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/internet.jpg" width="300">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"lo"</p>
                <br>
                <p>-The first message sent on the Internet</p>
                <br>
                <p style="font-style: italic;">"login"</p>
                <br>
                <p>-The second message sent on the Internet (one hour after the first message sent on the Internet crashed the Internet)</p>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/Category:Internet#/media/File:Internet_map_1024.jpg'>The Opte Project</a> under the <a href='https://creativecommons.org/licenses/by/2.5/'>CC 2.5 BY</a>
</div>

<hr>

## Generally

Python cut its teeth in the internet age, and consequently its standard library and third party packages have web capabilities. This is going to cover some of the more common operations you'll see that involve client-side web programming. In other words: fetching from websites, parsing, etc.

---

# Modules covered

### Standard Library
* [email](https://docs.python.org/3.4/library/email.html#module-email)
* [json](https://docs.python.org/3/library/json.html)
* [pathlib](https://docs.python.org/3/library/pathlib.html)
* [smtplib](https://docs.python.org/3/library/smtplib.html)
* [urllib.request](https://docs.python.org/3/library/urllib.request.html#module-urllib.request)
* [urllib.parse](https://docs.python.org/3/library/urllib.parse.html)
* [webbrowser](https://docs.python.org/3/library/webbrowser.html)

### Third Party Libraries
* [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [requests](http://docs.python-requests.org/en/master/)
* [pandas](https://pandas.pydata.org/)
* [win32com.client](http://docs.activestate.com/activepython/3.3/pywin32/html/com/win32com/HTML/QuickStartClientCom.html)

# Modules not covered

### Standard Library
* [ftplib](https://docs.python.org/3/library/ftplib.html)
* [xml](https://docs.python.org/3/library/xml.html)

### Third Party Libraries
* [selenium](http://selenium-python.readthedocs.io/)

---

In [1]:
# Stdlib imports
import email.mime.text
import json
import pathlib
import smtplib
import urllib.request
import webbrowser

# Third party imports
import bs4
import pandas as pd
import requests
import win32com.client

---

# Web Requests

### Fetching web pages using the the standard library. 

Note: specialized authentication may exist.


In [2]:
# Define a URL
HTTPS_URL = 'https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail'

# Open the request
r = urllib.request.urlopen(HTTPS_URL)

# Check the status of my request
status_code = r.getcode()
print('The status of my request is {}!\n\n'.format(status_code))

# Convert the raw bytes to text
raw_data = r.read()
text_data = raw_data.decode()

# Do stuff with HTML output
print('Here is the start of our data from {} !\n\n'.format(HTTPS_URL))
print(text_data[:100])

# Close request
r.close()

The status of my request is 200!


Here is the start of our data from https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail !


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title


### Though it's better to do it like this with a context manager:

In [3]:
# Doing the above again as a context manager:
with urllib.request.urlopen(HTTPS_URL) as f:
    text_data = f.read().decode()
    
print('\n\n\nHere is the data again!\n\n\n{}'.format(text_data))




Here is the data again!


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Monty Python and the Holy Grail - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Monty_Python_and_the_Holy_Grail","wgTitle":"Monty Python and the Holy Grail","wgCurRevisionId":865639304,"wgRevisionId":865639304,"wgArticleId":19701,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["EngvarB from June 2016","Use dmy dates from June 2016","Articles with hCards","All articles with unsourced statements","Articles with unsourced statements from October 2018","1975 films","English-language films","Monty Python and the Holy Grail",

### We can also download files as necessary and put them in RAM or write them to the filesystem:

In [4]:
# Define a URL
FOOT_URL = 'https://upload.wikimedia.org/wikipedia/commons/a/ab/Monty_python_foot.png'
FOOT_OUT = './static/monty_python_foot.png'

# Download data
with urllib.request.urlopen(FOOT_URL) as f:
    binary_data = f.read()

# Write data to a file.
with open(FOOT_OUT, 'wb') as f:
    f.write(binary_data)
    
# Or to a temporary buffer-like interface like io.BytesIO or tempfile.TemporaryFile
# Output using keyword args (not necessary)
args = {
    'length': len(binary_data), 
    'url': FOOT_URL, 
    'dest': FOOT_OUT
}
print('We downloaded {length} bytes of data from {url} , and wrote it to {dest}!'.format(**args))

We downloaded 209612 bytes of data from https://upload.wikimedia.org/wikipedia/commons/a/ab/Monty_python_foot.png , and wrote it to ./static/monty_python_foot.png!


### If needed you can break it into chunks:

In [5]:
# Download data using an infinite loop (generally a bad idea)
with open(FOOT_OUT, 'wb') as out_file:
    # Nested context manager
    with urllib.request.urlopen(FOOT_URL) as web_file:
        # Infinite loop
        while True:
            # Read data
            binary_data = web_file.read(8192)
            # If no data left, will be None and loop will break.
            if not binary_data:
                break
            # Write data
            out_file.write(binary_data)
            print('.', end='')
        print('\nDing! Loops are done.')
    # Web connection closes here
# Binary file closes here

..........................
Ding! Loops are done.


### Usually the "requests" module, which describes itself as "HTTP for Humans" is easier, but it can sometimes run afoul of proxy servers:

In [6]:
# Certificate location (this will not work for you)
TARGET_URL = 'https://www.google.com'

# Open the request
r = requests.get(TARGET_URL)

# Stop if error
r.raise_for_status()

# As text
text = r.text
print(text[:100])

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content


### Once you have HTML, you can get to scraping using beautifulsoup ("bs4"):

In [7]:
# Load your soup
soup = bs4.BeautifulSoup(text_data, 'lxml')

# Iterate through all the tables to get our table
for table in soup.find_all('table'):
    # Check if it's the table we want
    if 'wikitable' in table.attrs['class']:
        target_table = table

# Alternatvely we can get it directly
target_table = soup.select('table.wikitable')[0]
print(str(target_table))

<table class="wikitable sortable plainrowheaders">
<tbody><tr>
<th scope="col" style="width:110px;">Actor
</th>
<th scope="col">Main role
</th>
<th class="unsortable" scope="col">Other roles (in order of appearance)
</th></tr>
<tr>
<th scope="row"><span data-sort-value="Chapman, Graham"><span class="vcard"><span class="fn"><a href="/wiki/Graham_Chapman" title="Graham Chapman">Graham Chapman</a></span></span></span>
</th>
<td><a href="/wiki/King_Arthur" title="King Arthur">Arthur, King of the Britons</a>
</td>
<td>Voice of <a href="/wiki/God" title="God">God</a> / Middle Head of Three-Headed Knight / Hiccuping Guard at Swamp Castle
</td></tr>
<tr>
<th scope="row"><span data-sort-value="Cleese, John"><span class="vcard"><span class="fn"><a href="/wiki/John_Cleese" title="John Cleese">John Cleese</a></span></span></span>
</th>
<td><a href="/wiki/Lancelot" title="Lancelot">Sir Lancelot the Brave</a>
</td>
<td>Swallow-Savvy Guard #2 / Man with "Dead" Body / <a href="/wiki/Black_Knight_(Mont

In [8]:
webbrowser.open_new('https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail#Cast')

True

### And put it in a dataframe for organized manipulation:

In [9]:
# Now that we have our table, convert it to something we can use.
df = pd.read_html(str(target_table), header=0)[0]

# Don't even try to understand this, but here the formatting is weird.
table_data = [
    [
        cell.find('span').text
        if cell.find('span')
        else cell.text
        for cell
        in row.find_all(['th', 'td'])
    ]
    for row
    in target_table.find_all('tr')
]

# Coerce to dataframe
df = pd.DataFrame.from_records(table_data[1:], columns=table_data[0])

# Do arbitrary stuff to our data.
df['LAST_NAME'] = df['Actor\n'].str.split().str.get(0)
df['FIRST_NAME'] = df['Actor\n'].str.split().str.get(1)
df.head()

Unnamed: 0,Actor,Main role,Other roles (in order of appearance),LAST_NAME,FIRST_NAME
0,Graham Chapman,"Arthur, King of the Britons\n",Voice of God / Middle Head of Three-Headed Kni...,Graham,Chapman
1,John Cleese,Sir Lancelot the Brave\n,"Swallow-Savvy Guard #2 / Man with ""Dead"" Body ...",John,Cleese
2,Terry Gilliam,"Patsy, Arthur's Servant\n",Green Knight / Singing Camelot Knight #3 / Gor...,Terry,Gilliam
3,Eric Idle,Sir Robin the Not-Quite-So-Brave-as-Sir-Lancel...,Dead Collector / Witch-Hunting Villager #1 / S...,Eric,Idle
4,Terry Jones,Sir Bedevere the Wise\n,Dennis' Mother / French Knight / Left Head of ...,Terry,Jones


### Or if we want to crawl, we can get the links and go from there.

In [10]:
# We can manually put in the base
base = 'https://en.wikipedia.org'

# Or we can get it more generally
url_obj = urllib.parse.urlsplit(HTTPS_URL)
base = url_obj.scheme + '://' + url_obj.hostname

# Get all links.
links = []
for link in soup.find_all('a'):
    try:
        new_url = base + link.attrs['href']
        if link.attrs['href'].startswith('http'):
            pass
        links.append(new_url)
    except KeyError:
        pass

links[:5]

['https://en.wikipedia.org#mw-head',
 'https://en.wikipedia.org#p-search',
 'https://en.wikipedia.org/wiki/File:Monty-Python-1975-poster.png',
 'https://en.wikipedia.org/wiki/Terry_Gilliam',
 'https://en.wikipedia.org/wiki/Terry_Jones']

### Where this really comes into its own is with REST APIs to fetch structured data from the internets:

In [11]:
CFPB_ENDPOINT = 'https://data.consumerfinance.gov/resource/jhzv-w97w.json?'

QUERY_ARGS = {
    'state'   : 'MO',
    'product' : 'Credit card'
}

# Let's construct our REST query string
query_string = urllib.parse.urlencode(QUERY_ARGS)
full_url = CFPB_ENDPOINT + query_string

print('We are querying: {} !\n\n'.format(full_url))

# And then fetch our data
with urllib.request.urlopen(full_url) as f:
    json_data = f.read().decode()

# We can load this into native Python datatypes
data = json.loads(json_data)
print(data[0])
print('\n\n')

# Or we can immediately go to pandas.
df = pd.read_json(json_data)

# And refine as needed.
df.dropna(subset=['complaint_what_happened']).head(3)

We are querying: https://data.consumerfinance.gov/resource/jhzv-w97w.json?state=MO&product=Credit+card !


{'company': 'DISCOVER BANK', 'company_response': 'Closed with monetary relief', 'complaint_id': '192017', 'consumer_consent_provided': 'N/A', 'consumer_disputed': 'No', 'date_received': '2012-11-15T00:00:00.000', 'date_sent_to_company': '2012-11-16T00:00:00.000', 'issue': 'Other fee', 'product': 'Credit card', 'state': 'MO', 'submitted_via': 'Postal mail', 'tags': 'Servicemember', 'timely': 'Yes', 'zip_code': '63135'}





Unnamed: 0,company,company_public_response,company_response,complaint_id,complaint_what_happened,consumer_consent_provided,consumer_disputed,date_received,date_sent_to_company,issue,product,state,submitted_via,tags,timely,zip_code
822,WELLS FARGO & COMPANY,Company chooses not to provide a public response,Closed with explanation,1560790,I applied for a Dillard 's charge card as I wo...,Consent provided,Yes,2015-09-11T00:00:00.000,2015-09-11T00:00:00.000,Unsolicited issuance of credit card,Credit card,MO,Web,Older American,Yes,630XX
823,BARCLAYS BANK DELAWARE,Company has responded to the consumer and the ...,Closed with explanation,2250674,Thank you for being there. I called this after...,Consent provided,No,2016-12-15T00:00:00.000,2017-01-18T00:00:00.000,Billing disputes,Credit card,MO,Web,,Yes,63401
824,"CITIBANK, N.A.",Company chooses not to provide a public response,Closed with non-monetary relief,1585513,I have a revolving credit card with Citibank. ...,Consent provided,No,2015-09-29T00:00:00.000,2015-09-29T00:00:00.000,APR or interest rate,Credit card,MO,Web,Older American,Yes,641XX


### And again:

In [12]:
# Base URL
FEMA_URL = 'http://www.fema.gov/api/open/v1/DisasterDeclarationsSummaries?$filter=declarationDate%20gt%20\'{}\''

# Let's construct our REST query string
pd.datetime.now().date().isoformat()
now = pd.datetime.now()
# We can use all the handy libraries of Python!
one_hundred_bdays_ago = now - pd.tseries.offsets.BusinessDay(100)
date = one_hundred_bdays_ago.date()
iso_date = date.isoformat()
full_url = FEMA_URL.format(iso_date)

print('Querying {} !'.format(full_url))

# And then fetch our data
with urllib.request.urlopen(full_url) as f:
    json_data = f.read().decode()
    data = json.loads(json_data)

df = pd.DataFrame(data['DisasterDeclarationsSummaries'])
df.head(5)

Querying http://www.fema.gov/api/open/v1/DisasterDeclarationsSummaries?$filter=declarationDate%20gt%20'2018-06-11' !


Unnamed: 0,declarationDate,declaredCountyArea,disasterNumber,disasterType,fyDeclared,hash,hmProgramDeclared,iaProgramDeclared,id,ihProgramDeclared,incidentBeginDate,incidentEndDate,incidentType,lastRefresh,paProgramDeclared,placeCode,state,title
0,2018-06-11T18:28:00.000Z,Albany (County),5241,FM,2018,58a181563edc5de162e689cef7f78a1e,True,False,5b290d8d115d452fde1036bd,False,2018-06-11T12:00:00.000Z,,Fire,2018-07-16T15:10:27.612Z,True,99001.0,WY,BADGER CREEK FIRE
1,2018-06-17T22:19:00.000Z,Lyon (County),5242,FM,2018,3fae9360ba9e14faf76358d9cc9ee31b,True,False,5b2acb4d115d452fde150f54,False,2018-06-17T13:00:00.000Z,,Fire,2018-07-16T15:10:27.621Z,True,99019.0,NV,UPPER COLONY FIRE
2,2018-06-22T00:02:00.000Z,Jefferson (County),5243,FM,2018,10ed28290b622196c71a336f904d1db9,True,False,5b2d5e93115d452fde1a1986,False,2018-06-21T18:30:00.000Z,2018-06-25T18:00:00.000Z,Fire,2018-08-01T13:10:02.496Z,True,99031.0,OR,GRAHAM FIRE
3,2018-06-24T01:56:00.000Z,Lake (County),5244,FM,2018,f713a42a0d6917057be8e9a8e934959d,True,False,5b329f77115d452fde2437b1,False,2018-06-24T01:56:00.000Z,2018-07-06T23:59:00.000Z,Fire,2018-07-16T16:26:40.115Z,True,99033.0,CA,PAWNEE FIRE
4,2018-06-25T00:05:00.000Z,Shasta (County),5245,FM,2018,7245e74af0f6eacfbc324b70bc9ea671,True,False,5b329f77115d452fde2437b0,False,2018-06-24T21:15:00.000Z,2018-06-29T23:59:00.000Z,Fire,2018-09-12T14:24:07.453Z,True,99089.0,CA,CREEK FIRE


### We can also email stuff.

Note: the method below requires MS Outlook.

In [13]:
# We can also email stuff
STRING_OF_RECIPIENTS = '''
    <theo.naunheim@gmail.com>;
'''

# Easy way to extract emails from formatted string
recipient_series = pd.Series([STRING_OF_RECIPIENTS])
extracted_emails = recipient_series.str.extractall("<(.*?)>")
email_addresses = extracted_emails[0].values

# Send table
table_html = table_content=df.head(5)[['declarationDate', 'declaredCountyArea']].to_html()

EMAIL_HTML = '''

<head>

    <style type="text/css">
    
        body, table, td {{font-family: Segoe UI, sans-serif !important; color: #34282C;}}
        table {{border-width: 20px; width: 100%;}}
        th, td {{text-align: left; padding: 8px; border: 1px solid white; border-collapse: collapse; }}
        th {{background-color: #000080; color: white;}}

    </style>

</head>

<body>

    <p>
        <h1>Automated email sent from Python automation presentation.</h1><br>
        <span>
            <img src="cid:{img_path}" width="100" alt="foot">
            <img src="cid:{img_path}" width="100" alt="foot">
        </span>
        <br>
        I met a traveller from an antique land<br>
        Who said: Two vast and trunkless legs of stone<br>
        Stand in the desert ... near them, on the sand,<br>
        Half sunk, a shattered visage lies, whose frown,<br>
        And wrinkled lip, and sneer of cold command,<br>
        Tell that its sculptor well those passions read<br>
        Which yet survive, stamped on these lifeless things,<br>
        The hand that mocked them and the heart that fed;<br>
        <br>
        And on the pedestal these words appear:<br>
        'My name is Ozymandias, king of kings;<br>
        Look on my works, ye Mighty, and despair!'<br>
        Nothing beside remains. Round the decay<br>
        Of that colossal wreck, boundless and bare<br>
        The lone and level sands stretch far away.<br>
        <br>
    </p>

    {table_content}

</body>

'''.format(
    table_content=table_html,
    img_path=str(pathlib.Path(FOOT_OUT).name)
)

try:
    # Create an application object
    outlook = win32com.client.Dispatch("Outlook.Application")

    # Create your email
    mail = outlook.CreateItem(0)
    mail.To = ';'.join(email_addresses)
    mail.Subject = 'Ozymandaias'

    # Add attachemnts.
    veggie_csv = pathlib.Path(FOOT_OUT).parent.parent.absolute() / 'data' / 'iris_dataset.csv'
    mail.Attachments.Add(str(veggie_csv.absolute()))
    foot_attach = mail.Attachments.Add(str(pathlib.Path(FOOT_OUT).absolute()))
    
    # Set CID of foot attachment
    foot_attach.PropertyAccessor.SetProperty(
        "http://schemas.microsoft.com/mapi/proptag/0x3712001E", 
        str(pathlib.Path(FOOT_OUT).name)
    )
    
    # Add body
    mail.HTMLBody = EMAIL_HTML
    
    # mail.CC = 'recipient1@gmail.com; recipient2@gmail.com'
    # mail.BlindCopyTo = "alice_bob@gmail.com"

    # Send your mail
    mail.Send()
    # outlook.Quit() # Optional: you will probably want to keep Outlook open.
except Exception as e:
    raise Exception("Something went wrong, but I have no idea what is happening on your particular machine. :" + str(e))

### It's generally preferable to send it directly via server, but that takes some prep:

In [14]:
# Construct your email message
# msg = email.mime.text.MIMEText(EMAIL_HTML, 'html') 
# msg['Subject'] = 'Ozymandias' 
# msg['From'] = 'Percy Bysshe Shelley' 
# msg['To'] = ','.join(LIST_OF_RECIPIENTS) 

# Send msg (this is the test server below, your server may differ). 
# s = smtplib.SMTP(host='server', port=25) 
# s.sendmail(msg['From'], msg['To'], msg.as_string()) 
# s.quit() 

# Additional Learing Resources

* ### [Webscraping on Analytics Vidhya](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)
* ### [Webscraping on Hitchhiker's Guide to Python](http://docs.python-guide.org/en/latest/scenarios/scrape/)
* ### [Email Examples in Python Documentation](https://docs.python.org/3.4/library/email-examples.html)

---

# Next Up: [Database](3_database.ipynb)

<br>

<img style="margin-left: 0;" src="static/database.png" width="20%">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Applications-database.svg'>Dracos</a> under the <a href='https://creativecommons.org/licenses/by-sa/3.0/deed.en'>CC BY-SA 3.0</a>
</div>

---