# The urllib module

The urllib module connects python to the web!!!

### urllib.request example

In Python 2.7 you import urlib or import urllib2

In Python 3.x you import urllib.request

In [8]:
import urllib.request

open url and load the data in (lazy)

In [9]:
x = urllib.request.urlopen("https://www.google.com")

the .read() method will return all the html info at the opened url

In [10]:
x.read()

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="/images/google_favicon_128.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:\'1fR_VLnzPNOyoQT4lYCYCg\',kEXPI:\'4011009,4016824,4020562,4021586,4021598,4022495,4023367,4023567,4024625,4024970,4024978,4025285,4025769,4025827,4026224,8300096,8500393,8500852,10200083,10200716,10200850\',authuser:0,kSID:\'1fR_VLnzPNOyoQT4lYCYCg\'};google.kHL=\'en\';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){};google.time=function(){return(n

In total:
>   1. We passed in a url
>   2. Python opened it
>   3. Asked for some data
>   4. Read the results
    
We do this in long form using python in the following example

### Anatomy of a url

http://www.pythonprogramming.net/?s=basic&submit=Search

Breaking down the above link:
    
    http://www.pythonprogramming.net/ -- href (website)
    
    ?s=basic&submit=Search -- url

    s, submit -- variables (first variable in url will
                start with ?)

    & -- joins a get variable request (s=submit&submit=Search)

Therefore, the url: 

    ?s=basic&submit=Search

represents a get request to the website above

### urllib.parse example

In [15]:
import urllib.parse

In [22]:
# base url we want to visit
url = 'http://pythonprogramming.net'

Here we are encoding the url below using python

    ?s=basic&submit=Search ----

In [23]:
# dictionary of keys = variable and values = values
# use this dict to encode values for use in a url

values = {'s':'basic', 'submit':'Search'}

In [28]:
# .urlencode() encodes values as they should appear in a url (url format)
# so a space would come up in url as %20, ?s= would signify first variable s
# .urlencode(values) will generate our url using dictionary key,value pairs

data = urllib.parse.urlencode(values)

# next we encode the url in utf-8 using the .encode() method

data = data.encode('utf-8')

# now we have the url we want to post in the correct encoding

# here we send a request, requesting from the url the data we want
req = urllib.request.Request(url, data)

# now we visit the url 
resp = urllib.request.urlopen(req)

# and finally read the results of the request
respData = resp.read()

# So in sum: we visited a url, passed in some data, and read the results
print(respData)

b'<!DOCTYPE html>\n<html itemscope="itemscope" itemtype="http://schema.org/SearchResultsPage" lang="en-US" prefix="og: http://ogp.me/ns#">\n<head>\n\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />\n\t<title>Search for &quot;basic&quot; - Python Programming</title>\n\t<link rel="profile" href="http://gmpg.org/xfn/11" />\n\t<link rel="pingback" href="http://pythonprogramming.net/xmlrpc.php" />\n\t<!--[if lt IE 9]>\n\t<script src="http://pythonprogramming.net/wp-content/themes/independent-publisher/js/html5.js" type="text/javascript"></script>\n\t<![endif]-->\n\t\n<!-- This site is optimized with the Yoast WordPress SEO plugin v1.5.3.3 - https://yoast.com/wordpress/plugins/seo/ -->\n<meta name="robots" content="noindex,follow"/>\n<link rel="canonical" href="http://pythonprogramming.net/search/basic/" />\n<link rel="next" href="http://pythonprogramming.net/search/basic/page/2/" />\n<meta property="og:locale" content="en_U

Yay! We got the same thing as above for lots more work!!!

### Problem: Some websites dont like programs accessing the data

Well, most likely this means they have an api to access the data

Examples like google, twitter, etc

In [27]:
# here we try to request a google search for test
try:
    x = urllib.request.urlopen('https://www.google.com/search?q=test')
    print(x.read())
    
except Exception as e:
    print(str(e))

HTTP Error 403: Forbidden


Google blocks us. So how do we get around this?

In [29]:
try:
    x = urllib.request.urlopen('https://www.google.com/search?q=test')
    
    # headers are the data we send in to a website, its info on you!
    headers = {}
    # User-Agent gives browser we are using, or program
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
    # data or headers are already hardcoded into url, so we call default value
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req)
    # This is a ton of data, so we wont be printing
    respData = resp.read()
    
    saveFile = open('withHeaders.txt', 'w')
    saveFile.write(str(respData))
    saveFile.close()
    
except Exception as e:
    print(str(e))

HTTP Error 403: Forbidden


well it didnt work, google got smarter since 7/19/2014 :(

It worked for sentdex

https://www.youtube.com/watch?v=5GzVNi0oTxQ&index=36&list=PLQVvvaa0QuDe8XSftW-RAxdo6OmaeL85M