# Welcome to the Dark Art of Coding:
## Introduction to Python
Gathering data from the web

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Use and understand the basics of the urllib module
* Use and understand the basics of the beautiful soup library

# Networks
---

## TCP

TCP is a protocol that is used to send data across a network

* It relies upon some builtin mechanisms to help increase reliability
* TCP creates connections between two devices (connection oriented protocol)
* It uses checks to ensure that all data has been sent, if not can request that missing data be resent
* TCP organizes packets in order
* Between the reliability checks and the organization/ordering of packets, it work very well for sending files (like web pages)


## Port numbers

The TCP protocol incorporates the use of port numbers:

* Port numbers are used by computers to ensure that traffic coming to a given computer gets funneled to the correct application
* Multiple ports allow multiple applications on the same computer to talk without interfering with each other
* Typically certain applications have default TCP port numbers that are used to send higher-level protocols

Task | Port
:----|:----
Telnet | 23
SSH | 22
HTTP | 80
HTTPS | 443
SMTP (E-mail) | 25
DNS (Domain Name) | 53
FTP (File Transfer) | 21

## HTTP (Hyper Text Transfer Protocol)

HTTP is a common protocol that may be sent using TCP.

* HTTP is the standard Protocol for most applications on the internet
* Invented to retrieve HTML, images, Documents, etc.
* Basic concept:
    * Make a connection
    * Request a document
    * Retrieve the document
    * Close the connection

A typical Uniform Resource Locator (URL) address has several components:

* The URL indicates the protocol, generally HTTP (but it could be others)
* It lists the server that hosts the document
* The name and path to the document

http://  | www.py4e.com | /lessons/network
:--------|:-------------|:----------------
Protocol | Host         | Document

# HTTP

* Browser attempts to connect to `http://www.example.com`
* Issues a request for a document such as `index.html`
* The server sends the html document
* Browser renders html
* Closes connection when done

# Standing up a local HTTP server
---

Python lets you stand up your own HTTP server.
This lesson is designed for use in connectionless environments, where you may not have access to the Internet.

In those cases, we start this lesson by standing up our own server and using Python to interact with webpages on that server. The behaviors and code will all be the same >>> the only thing that changes is the URL.

Do the following on the commandline. It will run an HTTP server on your local computer... in the folder where you execute the Python command

```bash
$ cd path/to/the/lesson 11/folder/11
$ python -m http.server 8000
```

|||
|:--|:---|
|`python` | calls the Python interpreter directly|
|`-m` | requests that the interpreter load the `http.server` module, which automatically starts a basic HTTP server.|
|`8000` | is the port that your server is running on.|

Open your browser and surf to:

```bash
localhost:8000
```


You will see something looks like this:
    
<img src='./http_server_dir_list.png' width='300' style="float:right">

# HTTP requests in Python using urllib
---

In [None]:
# First we have to import the request module from urllib

import urllib.request

# urllib allows us to open web pages just like opening files.
# The following command creates an http.client.HTTPResponse object that
#     gives us access to a number of attributes and behaviors
#     related to the data retrieved

file = urllib.request.urlopen('http://localhost:8000/annabel_lee.txt')

# A common technique is to use a for loop to cycle through every
# line and print out the data one line at a time
# In this case, the data is read in as bytes

for line in file:
    # We convert each line from bytes to strings using the
    #     .decode() attribute.
    print(line.decode().strip())

In [None]:
# Much like other files we have looked at, we can 
# read and evaluate the text in web-based text files, like
# like counting words

import pprint
import urllib.request

url = 'http://localhost:8000/annabel_lee.txt'
file = urllib.request.urlopen(url)

count = {}

for line in file:
    
    # Again, we take the line and use .decode() to convert
    #     the data to a string
    #     Then we strip the newline
    #     Then we split it on spaces
    words = line.decode().strip().split()
    
    # We cycle through the words one at a time
    for word in words:
        
        # If a key for the word already exists .get() grabs the value otherwise it automatically returns 0
        count[word] = count.get(word, 0) + 1

pprint.pprint(count)

# Unicode and Python text

* Internally, within Python 3+, all Python strings are Unicode
* When we talk to a network we usually have to encode and decode our data (generally to `utf-8`)
* When we recieve data we typically recieve it as a `bytes` object which we then pass through a `.decode()` method to get a string


In [None]:
# Poor man debugging...
# This is one of the most important lines of code to any new Pythonista

print(type(line), line)

In [None]:
# Let us look at the difference between outputting:
#     a bytes object vs.
#     a string

print(line)
print(line.decode())

# Reading web pages
---

In [None]:
# Our earlier examples were fairly straightforward, since we 
#     retrieved text files. Most of the web is not 
#     straight text files, it is composed of 
#     Hyper Text Markup Language (HTML)

# We request a page using urllib.request.urlopen()

page = urllib.request.urlopen('http://localhost:8000/jabberwocky.html')

text = page.read()
print(text)

In [None]:
# Our earlier examples were fairly straightforward, since we 
#     retrieved text files. Most of the web is not 
#     straight text files, it is composed of 
#     Hyper Text Markup Language (HTML)

# We request a page using urllib.request.urlopen()

page = urllib.request.urlopen('http://localhost:8000/jabberwocky.html')

for line in page:
    print(line.decode().strip())

# Beautiful soup
---

While it is possible to use `urllib` to read data from the web, a third party library, `Beautiful Soup` is commonly used instead to supplement urllib. `Beautiful Soup`:

* Makes reading and parsing web pages a lot easier
* Allows you to extract tags of only certain types
* You can find certain tags based on their relationship in the tag heirarchy
* Getting hyperlinks becomes a whole lot easier

## On the command line

To install Beautiful Soup, you can run this on the command line:

*`conda install beautifulsoup4`*

## In a Python file/interpreter

In [None]:
# Import the necessary modules

from bs4 import BeautifulSoup
import urllib.request

# Get the html text from the HTTPResponse object
# Notice the read() method >>>

htmlText = urllib.request.urlopen('http://localhost:8000/jabberwocky.html').read()

# Use bs4 to create a soup object from our html text
# Provide a argument to identify which type of parser to
#     use, in this case, an html parser

soup = BeautifulSoup(htmlText, 'html.parser')

In [None]:
# The soup object allows you to retrieve specific types of tags, in this
#     anchor tags (identified using an 'a'). Anchor tags are used for links.

tags = soup('a') 

In [None]:
# Let's cycle through the tags and get the 'href' attribute.
# This is the data that contains the link itself

for tag in tags:
    print(tag.get('href', None))

# Using documentation
---

Let's explore the documentation for a third party library.

The documentation for Beautiful Soup has a number of nice attributes that can get you started fairly quickly, so let's use the documentation to enhance our knowledge of the subject.

[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Web scraping
---

## What is web scraping?

Web scraping is a technique used to retrieve data from the web OR from similar networks (intranets, etc).

* Web scrapers simulate the behavior of a browser
* They look at the data from specific site(s)
* They extract specific information you need from it
* Typically this is done over and over again across multiple sites

## Why web scrape?

* Get data from a sites that don't provide mechanisms to export the data
* Collect information on sites to build a search engine database
* Monitor sites for changes
* Collect social network data
    * who is connected to or communicates with who?
    * What is being said

In [None]:
# source:
# http://www.jabberwocky.com/carroll/jabber/jabberwocky.html