### Using Socket

In [4]:
## This is a low-level to get file or image of the web server
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))


In [5]:
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n'.encode())
while True:
    data = mysock.recv(512)
    if (len(data) < 1) :
        break
    print (data)
mysock.close()

b'HTTP/1.1 404 Not Found\r\nServer: nginx\r\nDate: Thu, 09 Apr 2020 04:21:03 GMT\r\nContent-Type: text/html\r\nContent-Length: 146\r\nConnection: close\r\n\r\n<html>\r\n<head><title>404 Not Found</title></head>\r\n<body>\r\n<center><h1>404 Not Found</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'


### Using urllib 
Using urllib, I can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

In [6]:
import urllib.

In [9]:
fhand = urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
for line in fhand:
    print (line.strip())

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'


### Parsing HTML using regular expressions
One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.

In [13]:
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
#print html
links = re.findall('href="(http://.*?)"', html)
for link in links:
    print (link)

Enter - 'http://www.py4inf.com/code/romeo.txt'


URLError: <urlopen error unknown url type: 'http>

### Obtain binary file : pictures and video using urllib

In [15]:
## Deal with small size file
img = urllib.request.urlopen('http://www.py4inf.com/cover.jpg').read()
fhand = open('cover.jpg', 'w')
fhand.write(img.decode())
fhand.close()
##This program reads all of the data in at once across the network and stores it in the
#variable img in the main memory of your computer, then opens the file cover.jpg
#and writes the data out to your disk. This will work if the size of the file is less
#than the size of the memory of your computer.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

However if this is a large audio or video file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
running out of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size file without using up all of the memory you have in your computer.

In [17]:
img = urllib.request.urlopen('http://www.py4inf.com/cover.jpg')
fhand = open('cover1.jpg', 'w')
size = 0
while True:
    info = img.read(100000) ## 100000 bytes is read in one time 
    if len(info) < 1 : break
    size = size + len(info)
    fhand.write(info)
print (size,'characters copied.')
fhand.close()

TypeError: write() argument must be str, not bytes

## Using Beatifulsoup

In [18]:
import urllib
from bs4 import BeautifulSoup 

In [19]:
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print (tag.get('href', None))

Enter - 'http://www.py4inf.com/code/romeo.txt'


URLError: <urlopen error unknown url type: 'http>

In [20]:
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
    print ('TAG:',tag)
    print ('URL:',tag.get('href', None))
    print ('Content:',tag.contents[0])
    print ('Attrs:',tag.attrs)

Enter - 'http://www.py4inf.com/code/romeo.txt'


URLError: <urlopen error unknown url type: 'http>