# Welcome to the Dark Art of Coding:
## Introduction to Python
Web scraping

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* socket module
* urllib module
* beautiful soup

# Networks
---

## TCP

* Sends data across a secure(ish) pipe
* Once data is recieved it double checks by sending the same data back to make sure it's the right data
* Once the check is complete then it finishes and returns the fully checked data to the application that asked for the data

## Port numbers

* Application specific points for communicating across the internet
* Multiple ports allow multiple applications to talk across the internet without interfering
* Typically certain TCP connections have default port numbers

Task | Port
:----|:----
Telnet | 23
SSH | 22
HTTP | 80
HTTPS | 443
SMTP (E-mail) | 25
DNS (Domain Name) | 53
FTP (File Transfer) | 21

## HTTP (Hyper Text Transfer Protocol)

* The standard Protocol for most applications on the internet
* Invented to retrieve HTML, images, Documents, etc.
* Basic concept:
    * Make a connection
    * Request a document
    * Retrieve the document
    * Close the connection

http://  | www.py4e.com | /lessons/network
:--------|:-------------|:----------------
Protocol | Host         | Document

# HTTP

* Connect to `www.discordapp.com`
* Request document using this data packet `GET www.discordapp.com/nitro HTTP 1.0`
* Get sent html document
* Browser renders html
* Closes connection when done

# HTTP requests in python using urllib
---

In [18]:
# First we have to import the requests module from urllib

import urllib.request

file = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') # urllib let's us open web pages like files

# We cycle through every line and print out the data one line at a time

for line in file:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [21]:
# Just like with other files before we can do similar things with txt files downloaded off the web
# Like counting the words

file = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

count = {}

for line in file:
    
    # We take the line and turn it into a string isntead of a bytes object
    #     Then we strip the newline
    #     Then we split it on spaces
    words = line.decode().strip().split()
    
    # We cycle through the words one at a time
    for word in words:
        
        # If a key for the word already exists .get() grabs the value otherwise it automatically returns 0
        count[word] = count.get(word, 0) + 1

print(count)

{'what': 1, 'with': 1, 'moon': 1, 'through': 1, 'envious': 1, 'and': 3, 'soft': 1, 'pale': 1, 'east': 1, 'already': 1, 'breaks': 1, 'sick': 1, 'window': 1, 'sun': 2, 'fair': 1, 'Juliet': 1, 'Who': 1, 'But': 1, 'yonder': 1, 'the': 3, 'Arise': 1, 'light': 1, 'It': 1, 'is': 3, 'kill': 1, 'grief': 1}


# Unicode and python text

* All python strings (in py3.+) are internally unicode
* When we talk to a network we usually have to encode and decode our data (usually to utf-8)
* When we recieve data we recieve it usually as a bytes object which we then pass through a `.decode()` method to get a string


In [4]:
# The most important line of code to any new programmer

print(type(line), line)

<class 'bytes'> b'Who is already sick and pale with grief\n'


In [5]:
# The difference between outputting a bytes object vs. a string

print(line)
print(line.decode())

b'Who is already sick and pale with grief\n'
Who is already sick and pale with grief



# Reading web pages
---

In [7]:
# We pull in the page using urllib.request.urlopen()

page = urllib.request.urlopen('http://dr-chuck.com/page1.htm')

for line in page:
    print(line.decode().strip())

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


# Beautiful soup
---

* Makes reading and parsing web pages a lot easier
* Allows you to grab tags of only certain types
* You can check certain tags relationship in the heirarchy
* Getting links became a whole lot easier

## On the command line

*`conda install beautifulsoup4`*

## In a python file/interpreter

In [24]:
# Import the necessay modules

from bs4 import BeautifulSoup
import urllib.request

# Grab the html text

htmlText = urllib.request.urlopen('http://dr-chuck.com/page1.htm').read()

# Use bs4 to create a soup object from our html text and a 
#     string argument to let bs4 know what it's looking at

soup = BeautifulSoup(htmlText, 'html.parser')

In [28]:
# To get a list of tags of a certain type you would do this

tags = soup('a')  # 'a' tags are called anchor tags and they are used for links

In [29]:
# Let's cycle through the tags and get the 'href' data portion. this is the data that contains the link itself

for tag in tags:
    print(tag.get('href', None))

http://www.dr-chuck.com/page2.htm


# Web scraping
---

## What is web scraping

* Going to a site pretending to be a browser
* Looking at the data from that site
* And extracting the information you need from it
* Typically this is done over and over again on multiple sites

## Why web scrape?

* Get data from a site that can't export it's data
* Collect information on sites to build a search engine database
* Monitor sites for changes
* Collect social data (who is connected to who?)