Skip to content
No description or website provided.
Python
Find file
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
README.md
page_getter.py
sample_logins.py

README.md

CookiePageGetter is a class that helps retrieve web pages from sites that require user logins. The class uses Python's urllib and cookiejar modules to supply a simple interface for sending complex HTTP requests to complete the login, or simple HTTP requests after login is completed.

Usage

Getting a web page

The following code will retrieve the main page of google.com, and save the response cookies to a file:

page_getter = CookiePageGetter()
data = page_getter.get_page_html("http://www.google.com")

Subsequent requests will use the cookies returned in the response.

Sending HTTP Headers

Many websites check specific HTTP headers and don't allow access if they are not present. CookiePageGetter supplies a simple interface to pass additional HTTP headers by passing the additional_headers parameter:

data = page_getter.get_page_html("http://www.foo.com", additional_headers={"Referer": "www.google.com"})

Some HTTP headers are automatically added by CookiePageGetter. See more information below.

Sending POST requests

Sometimes web pages require submitting form data as a POST request. It is easy to pass data and send an HTTP POST request by passing post_data to get_page_html

data = page_getter.get_page_html("http://www.foo.com/form.php", post_data={"Text": "bar"})

Logging in

Supporting cookies, HTTP headers and POST requests allows sending login requests to websites. Some web sites require complex login requests in order to verify the user. Subclassing CookiePageGetter and implementing the method log_in with the required requests, will provide an easy interface for retrieving data from the websites after logging in.

class FooPageGetter(CookiePageGetter):
    def log_in(self, username, password):
        post_data = {"action": "login", "user": username, "passwd": password}
        additional_headers = {"Referer": "www.foo.com"}
        self.get_page_html("http://www.foo.com/login.php", additional_headers, post_data)

page_getter = FooPageGetter("foo_user", "Password1")
data = page_getter.get_page_html("http://www.foo.com/user_page.php")

log_in is called automatically if username and password are passed to the constructor of the class. It is also possible to make multiple requests in the log_in function if, for example, CSRF tokens are needed or if the server verifies that multiple requests are sent in the correct order. See some real-world examples in sample_logins.py

More features

  • It's possible to download binary files and save them locally using the method download_binary
  • get_page_html automatically detects the HTML encoding and returns decoded text - returns string, not bytes
  • The class automatically adds the HTTP header Accept-Encoding: gzip and decodes gzip-ed data if it is returned. This helps reduce bandwidth and response times.
  • The class automatically adds a User-Agent string to disguise as a popular browser. Many websites verify this string.
  • The class retries requests several times before failing, to try to recover from temporary connection glitches.
  • The class keeps track of bandwidth (bytes downloaded from the web-sites) and supplies easy logging of that bnadwidth.
Something went wrong with that request. Please try again.