# Explore Web Scraping Method

In this project, I explore 3 python packages which can be used for Website Scraping.

### Method 1. Using Socket

Socket is a low-level interface to pull information from website.

In [1]:
## This is a low-level to get file or image of the web server
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))

In [2]:
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n'.encode())
while True:
    data = mysock.recv(512)
    if (len(data) < 1) :
        break
    print (data)
mysock.close()

b'HTTP/1.1 404 Not Found\r\nServer: nginx\r\nDate: Thu, 09 Apr 2020 16:39:35 GMT\r\nContent-Type: text/html\r\nContent-Length: 146\r\nConnection: close\r\n\r\n<html>\r\n<head><title>404 Not Found</title></head>\r\n<body>\r\n<center><h1>404 Not Found</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'


Note: encode() converts string to byte-like object; decode() converts byte-like object to string.

### Method 2. Using urllib 
Using urllib, I can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

In [4]:
import urllib

In [5]:
fhand = urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
for line in fhand:
    print (line.strip())

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'


#### Parsing HTML using regular expressions
One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.

In [29]:
import re
url = 'http://www.py4inf.com'
html = urllib.request.urlopen(url).read().decode('utf-8')
#Using .decode('utf-8') to convert the byte-like object obtained from the request to string
print (html)
links = re.findall('href="(http://.*?)"', html)
for link in links:
    print (link)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>PythonLearn - Exploring Data</title>
<link rel="stylesheet" href="glike.css" type="text/css" />
<link rel="alternate" type="application/rss+xml" title="PythonLearn Podcast" 
href="http://www.pythonlearn.com/audiofeed.xml" />
</head>
<body>
<div id="header">
<h1><a href="index.php" class="selected" accesskey="1">PythonLearn</a></h1>
<ul class="toolbar">
<li><a href="book.php" >Book</a></li>
<li><a href="install.php" >Install</a></li>
<li><a href=http://www.pr4e.org/ target="_blank">MOOC</a></li>
<li><a href="http://www.dr-chuck.com/" target="_blank">Instructor</a></li>
<li><a href="http://www.python.org/" target="_blank">Python</a></li>
<li><a href="about.php" >About</a></li>
</ul>
</div>
<div id="main">
<div style="float: right; width:300px; padding: 5px;">
<iframe width="300" height="169" src="//www.youtube.com/embed/U

#### Obtain binary file : pictures and video using urllib

In [24]:
## Deal with small size file
img = urllib.request.urlopen('http://www.py4inf.com/cover.jpg').read()
fhand = open('result/cover.jpg','wb')
fhand.write(img)
fhand.close()
##This program reads all of the data in at once across the network and stores it in the
#variable img in the main memory of your computer, then opens the file cover.jpg
#and writes the data out to your disk. This will work if the size of the file is less
#than the size of the memory of your computer.

However if this is a large audio or video file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
running out of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size file without using up all of the memory you have in your computer.

In [25]:
img = urllib.request.urlopen('http://www.py4inf.com/cover.jpg')
fhand = open('result/cover1.jpg', 'wb')
size = 0
while True:
    info = img.read(100000) ## 100000 bytes is read in one time 
    if len(info) < 1 : break
    size = size + len(info)
    fhand.write(info)
print (size,'characters copied.')
fhand.close()

70057 characters copied.


### Method 3. Using Beatifulsoup

Beatifulsoup is a handy to tool to scrape website.

In [26]:
import urllib
from bs4 import BeautifulSoup 

In [27]:
url = 'http://www.py4inf.com'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print (tag.get('href', None))

index.php
book.php
install.php
http://www.pr4e.org/
http://www.dr-chuck.com/
http://www.python.org/
about.php
http://www.py4e.com/
http://www.py4e.com/
book.php
https://www.youtube.com/watch?v=UQVK-dsU7-Y&index=2&list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ
https://itunes.apple.com/us/podcast/python-for-informaticss-official/id711095516?mt=2
book.php
install.php
code.zip
pythonauto/index.php
https://drive.google.com/folderview?id=0B7X1ycQalUnyWXg2MVhTbEZFT28&usp=sharing
youtube/Py4Inf-01-Intro.php
podcasts/Py4Inf-01-Intro.mp3
videos/Py4Inf-01-Intro.mp4
youtube/Py4Inf-02-Expressions.php
podcasts/Py4Inf-02-Expressions.mp3
videos/Py4Inf-02-Expressions.mp4
http://www-personal.umich.edu/~csev/books/py4inf/exercises/Py4Inf-ex-02-02.mp4
http://www-personal.umich.edu/~csev/books/py4inf/exercises/Py4Inf-ex-02-03.mp4
youtube/Py4Inf-03-Conditional.php
podcasts/Py4Inf-03-Conditional.mp3
videos/Py4Inf-03-Conditional.mp4
http://www-personal.umich.edu/~csev/books/py4inf/exercises/Py4Inf-ex-03-01.mp4
http:

In [28]:
url = 'http://www.py4inf.com'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
    print ('TAG:',tag)
    print ('URL:',tag.get('href', None))
    print ('Content:',tag.contents[0])
    print ('Attrs:',tag.attrs)

TAG: <a accesskey="1" class="selected" href="index.php">PythonLearn</a>
URL: index.php
Content: PythonLearn
Attrs: {'href': 'index.php', 'class': ['selected'], 'accesskey': ['1']}
TAG: <a href="book.php">Book</a>
URL: book.php
Content: Book
Attrs: {'href': 'book.php'}
TAG: <a href="install.php">Install</a>
URL: install.php
Content: Install
Attrs: {'href': 'install.php'}
TAG: <a href="http://www.pr4e.org/" target="_blank">MOOC</a>
URL: http://www.pr4e.org/
Content: MOOC
Attrs: {'href': 'http://www.pr4e.org/', 'target': '_blank'}
TAG: <a href="http://www.dr-chuck.com/" target="_blank">Instructor</a>
URL: http://www.dr-chuck.com/
Content: Instructor
Attrs: {'href': 'http://www.dr-chuck.com/', 'target': '_blank'}
TAG: <a href="http://www.python.org/" target="_blank">Python</a>
URL: http://www.python.org/
Content: Python
Attrs: {'href': 'http://www.python.org/', 'target': '_blank'}
TAG: <a href="about.php">About</a>
URL: about.php
Content: About
Attrs: {'href': 'about.php'}
TAG: <a href="ht