Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols.

Urllib is a package that collects several modules for working with URLs, such as:

    urllib.request for opening and reading.
    urllib.parse for parsing URLs
    urllib.error for the exceptions raised
    urllib.robotparser for parsing robot.txt files


urllib.request

This module helps to define functions and classes to open URLs (mostly HTTP). One of the most simple ways to open such URLs is :

urllib.request.urlopen(url)

In [11]:

import urllib.request 
request_url = urllib.request.urlopen('https://www.geeksforgeeks.org/') 
print(request_url.read()) 

b'<!DOCTYPE html>\r\n<!--[if IE 7]>\r\n<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">\r\n<![endif]-->\r\n<!--[if IE 8]>\r\n<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">\r\n<![endif]-->\r\n<!--[if !(IE 7) | !(IE 8)  ]><!-->\r\n<html lang="en-US" prefix="og: http://ogp.me/ns#" >\r\n\r\n<!--<![endif]-->\r\n<head>\r\n<meta charset="UTF-8" />\r\n<meta name="viewport" content="width=device-width" />\r\n<meta name="description" content="A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.">\r\n<link rel="shortcut icon" href="https://www.geeksforgeeks.org/favicon.ico" type="image/x-icon" />\r\n\r\n<meta name="theme-color" content="#0f9d58" />\r\n\r\n<meta property="og:image" content="https://www.geeksforgeeks.org/wp-content/uploads/gfg_200X200.png">\r\n<meta property="og:image:type" content="image/p

urllib.parse

This module helps to define functions to manipulate URLs and their components parts, to build or break them. It usually focuses on splitting a URL into small components; or joining different URL components into URL string.

In [13]:

from urllib.parse import * 
parse_url = urlparse('https://www.geeksforgeeks.org / python-langtons-ant/') 
print(parse_url) 
print("\n") 
unparse_url = urlunparse(parse_url) 
print(unparse_url) 


ParseResult(scheme='https', netloc='www.geeksforgeeks.org ', path='/ python-langtons-ant/', params='', query='', fragment='')


https://www.geeksforgeeks.org / python-langtons-ant/


Different otherfunctions of urllib.parse are :
Function 	Use
urllib.parse.urlparse 	- Separates different components of URL
urllib.parse.urlunparse -	Join different components of URL
urllib.parse.urlsplit -	It is similar to urlparse() but doesn’t split the params
urllib.parse.urlunsplit -	Combines the tuple element returned by urlsplit() to form URL
urllib.parse.urldeflag - If URL contains fragment, then it returns a URL removing the fragment.

urllib.error

This module defines the classes for exception raised by urllib.request. Whenever there is an error in fetching a URL, this module helps in raising exceptions. The following are the exceptions raised :

    URLError – It is raised for the errors in URLs, or errors while fetching the URL due to connectivity, and has a ‘reason’ property that tells a user the reason of error.
    HTTPError – It is raised for the exotic HTTP errors, such as the authentication request errors. It is a subclass or URLError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden),
    and ‘401’ (authentication required).


In [None]:
# URL Error 
import urllib.request 
import urllib.parse 
  
# trying to read the URL but with no internet connectivity 
try: 
    x = urllib.request.urlopen('https://www.engadget.com') 
    print(x.read()) 
    
#Catching the exception generated      
except Exception as e : 
    print(str(e)) 


In [19]:

# HTTP Error 
  
import urllib.request 
import urllib.parse 
  
# trying to read the URL 
try: 
    x = urllib.request.urlopen('https://www.google.com/search?q = test') 
    print(x.read())  
#Catching the exception generated     
except Exception as e : 
    print(str(e)) 


HTTP Error 400: Bad Request


urllib.robotparser

This module contains a single class, RobotFileParser. This class answers question about whether or not a particular user can fetch a URL that published robot.txt files. Robots.txt is a text file webmasters create to instruct web robots how to crawl pages on their website. The robot.txt file tells the web scraper about what parts of the server should not be accessed.

In [18]:
# importing robot parser class 
import urllib.robotparser as rb 
  
bot = rb.RobotFileParser() 
  
# checks where the website's robot.txt file reside 
x = bot.set_url('https://www.geeksforgeeks.org/robot.txt') 
print(x) 
  
# reads the files 
y = bot.read() 
print(y) 
  
# we can crawl the main site 
z = bot.can_fetch('*', 'https://www.geeksforgeeks.org/') 
print(z) 
  
# but can not crawl the disallowed url 
w = bot.can_fetch('*', 'https://www.geeksforgeeks.org/wp-admin/') 
print(w) 


None
None
True
True


In [1]:
import urllib
print(dir(urllib))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'error', 'parse', 'request', 'response']


In [4]:
response = urllib.request.urlopen('http://stackoverflow.com')
#Using urllib.urlopen() will return a response object, 
#which can be handled similar to a file.

In [6]:
print(response.code)

200


The response.code represents the http return value. 200 is OK, 404 is NotFound, etc.

Other response codes:

    100 Continue
    101 Switching Protocols
    103 Early Hints
    200 OK
    201 Created
    202 Accepted
    203 Non-Authoritative Information
    204 No Content
    205 Reset Content
    206 Partial Content
    300 Multiple Choices
    301 Moved Permanently
    302 Found
    303 See Other
    304 Not Modified
    307 Temporary Redirect
    308 Permanent Redirect
    400 Bad Request
    401 Unauthorized
    402 Payment Required
    403 Forbidden
    404 Not Found
    405 Method Not Allowed
    406 Not Acceptable
    407 Proxy Authentication Required
    408 Request Timeout
    409 Conflict
    410 Gone
    411 Length Required
    412 Precondition Failed
    413 Payload Too Large
    414 URI Too Long
    415 Unsupported Media Type
    416 Range Not Satisfiable
    417 Expectation Failed
    418 I'm a teapot
    422 Unprocessable Entity
    425 Too Early
    426 Upgrade Required
    428 Precondition Required
    429 Too Many Requests
    431 Request Header Fields Too Large
    451 Unavailable For Legal Reasons
    500 Internal Server Error
    501 Not Implemented
    502 Bad Gateway
    503 Service Unavailable
    504 Gateway Timeout
    505 HTTP Version Not Supported
    506 Variant Also Negotiates
    507 Insufficient Storage
    508 Loop Detected
    511 Network Authentication Required

In [7]:
print(response.read())



In [9]:
print(dir(urllib.request))



HTTP POST

To POST data pass the encoded query arguments as data to urlopen()

In [10]:
import urllib
query_parms = {'username':'stackoverflow','password':'me.me'}
encoded_parms = urllib.parse.urlencode(query_parms).encode('utf-8')
response = urllib.request.urlopen("https://stackoverflow.com/users/login", encoded_parms)
response.code
response.read()

