# Week3
# Using urllib in Python

Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file

-- When we talk to an external resource like a network socket we send bytes, so we need to encode Python 3 strings into a given character encoding <br/>
-- When we read data from an external resource, we must decode it based on the character set so it is properly represented in Python 3 as a string


In [128]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

HTTP/1.1 400 Bad Request
Date: Tue, 21 Apr 2020 17:18:01 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 308
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
</body></html>



In [9]:
import urllib.request, urllib.parse, urllib.error


to run a for loop for line in this handle. 
So that's going to iterate through all the lines of this URL. 
So that's going to open the URL, read the data, and iterate with a for loop once through each line. 
Now, this line iteration is actually a byte array, not a string.

In [10]:
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())
    

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [11]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)



{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


In [122]:
# alternative way of writing the previoud code
import pandas as pd
import functools
import itertools

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
s = []
for line in fhand:
    s.append(line.decode().strip().split())

s = list(itertools.chain.from_iterable(s))
pd.value_counts(s)

and        3
is         3
the        3
sun        2
kill       1
It         1
window     1
soft       1
through    1
light      1
envious    1
already    1
sick       1
Arise      1
with       1
Who        1
grief      1
east       1
But        1
breaks     1
what       1
fair       1
Juliet     1
pale       1
moon       1
yonder     1
dtype: int64

# Homework 3

Exploring the HyperText Transport Protocol <br/>

You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers.

http://data.pr4e.org/intro-short.txt  <br/>

There are three ways that you might retrieve this web page and look at the response headers:

-- Preferred: Modify the socket1.py program to retrieve the above URL and print out the headers and data. Make sure to change the code to retrieve the above URL - the values are different for each URL. <br/>
-- Open the URL in a web browser with a developer console or FireBug and manually examine the headers that are returned. <br/>
-- Use the telnet program as shown in lecture to retrieve the headers and content. <br/>

In [127]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

mysock.close()


HTTP/1.1 200 OK
Date: Tue, 21 Apr 2020 17:17:56 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [125]:
fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())
    

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


Next we're going to talk about how you can more efficiently take apart the HTML and look for various things and print those things out. Because it turns out that HTML is so ugly and so inconsistent that things like regular 

# Week 4 
# Parsing Web Pages

In [14]:
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup


In [17]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))


Enter -http://www.dr-chuck.com/page2.htm
page1.htm


 BeautifulSoup library to retrieve and parse HTML and pull out anchor tags, which is really sort of the beginning of a browser. I mean, a beginning of a web crawler.

In [18]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')


Enter - http://www.dr-chuck.com/page2.htm


In [19]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
    

Enter - http://www.dr-chuck.com/page2.htm
page1.htm


In [20]:
x = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in x:
    print(line.decode().strip())


But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [21]:
import re 
ctx = '<p>Please click <a href="http://www.dr-chuck.com">here</a></p>'
re.findall('http://.*', ctx)


['http://www.dr-chuck.com">here</a></p>']

In [23]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'http://www.dr-chuck.com/page2.htm'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)
    

TAG: <a href="page1.htm">
First Page</a>
URL: page1.htm
Contents: 
First Page
Attrs: {'href': 'page1.htm'}


# Week 4 Homework 1

-- Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

-- We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553) <br/>
Actual data: http://py4e-data.dr-chuck.net/comments_448547.html (Sum ends with 1)

In [24]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

def words_sum(url):
    
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE


    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, "html.parser")
    tags = soup('span')

    Sum = 0
    for tag in tags:
        Sum = Sum + int(tag.contents[0])
    
    return Sum

url = 'http://py4e-data.dr-chuck.net/comments_448547.html'
words_sum(url)


2501

# Week 4 Homework 2

Following Links in Python

In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html <br/>
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.<br/>
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah <br/>
Last name in sequence: Anayah <br/>

Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Rhianne.html <br/>
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.<br/>
Hint: The first character of the name of the last page that you will load is: A 

-- Strategy <br/>
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for you to do the assignment without writing a Python program. But frankly with a little effort and patience you can overcome these attempts to make it a little harder to complete the assignment without writing a Python program. But that is not the point. The point is to write a clever Python program to solve the program.



In [25]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

def problem_search(url, count, position):
    
    # count = 5
    
    def find_name(url,position):      

        html = urllib.request.urlopen(url, context=ctx).read()
        soup = BeautifulSoup(html, 'html.parser')

        # Retrieve all of the anchor tags
        tags = soup('a')
        url_new = tags[position-1].get('href', None)

        return url_new

    def name(count):
        
        n = 1
        temp = url 

        while n < count+1:
            temp = find_name(temp,position)            
            n += 1
                
        return temp[:-5].split('_')[-1]
    
    output = name(count)
    
    return output


In [26]:
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
count = 4
position = 3 

problem_search(url,count,position)


'Anayah'

In [27]:
url = 'http://py4e-data.dr-chuck.net/known_by_Rhianne.html'
count = 7
position = 18 

problem_search(url,count,position)


'Ammer'

# Week 5
# Extensible Markup Language (XML)

In [28]:
import xml.etree.ElementTree as ET

data = '''
<person>
  <name>Chuck</name>
  <phone type="intl">
    +1 734 303 4456
  </phone>
  <email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))
print('Phone number:',('+' + ''.join(re.findall('[0-9]+',tree.find('phone').text))))


Name: Chuck
Attr: yes
Phone number: +17343034456


In [29]:
import xml.etree.ElementTree as ET

input = '''
<stuff>
  <users>
    <user x="2">
      <id>001</id>
      <name>Chuck</name>
    </user>
    <user x="7">
      <id>009</id>
      <name>Brent</name>
    </user>
  </users>
</stuff>'''

stuff = ET.fromstring(input) #stuff is a tree structured information 

lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Attribute', item.get('x'))


User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7


In [30]:
lst[0].find('name').text, lst[1].get('x')


('Chuck', '7')

# Homework 5

-- Extracting Data from XML

-- In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file.

-- We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.xml (Sum=2553) <br/>
Actual data: http://py4e-data.dr-chuck.net/comments_448549.xml (Sum ends with 25)

In [31]:
import urllib.request as ur
import xml.etree.ElementTree as et

url = 'http://py4e-data.dr-chuck.net/comments_448549.xml'
#print('Retrieving', url)

def calc_Sum(url):
    
    xml = ur.urlopen(url).read()
    #print('Retrieved', len(xml), 'characters')

    tree = et.fromstring(xml)
    Length_tree = len(tree.findall('.//count'))

    Sum = 0

    for i in range(Length_tree):
        Sum = Sum + int(tree.findall('.//count')[i].text)
    
    return Sum

calc_Sum(url)


2725

# Week 6
# JavaScript Object Notation (JSON)
-- The code is simpler 


In [32]:
import json
data = '''{
  "name" : "Chuck",
  "phone" : {
    "type" : "intl",
    "number" : "+1 734 303 4456"
   },
   "email" : {
     "hide" : "yes"
   }
}'''

info = json.loads(data)
print('Name:',info["name"])
print('Hide:',info["email"]["hide"])


Name: Chuck
Hide: yes


In [33]:
import json
input = '''[
  { "id" : "001",
    "x" : "2",
    "name" : "Chuck"
  } ,
  { "id" : "009",
    "x" : "7",
    "name" : "Chuck"
  }
]'''

info = json.loads(input)
print('User count:', len(info))
for item in info:
    print('Name', item['name'])
    print('Id', item['id'])
    print('Attribute', item['x'])

User count: 2
Name Chuck
Id 001
Attribute 2
Name Chuck
Id 009
Attribute 7


In [34]:
info

[{'id': '001', 'x': '2', 'name': 'Chuck'},
 {'id': '009', 'x': '7', 'name': 'Chuck'}]

# Google API

-- that's what json format looks like

In [129]:
{
    "status": "OK",
     "results": [
        {
            "geometry": {
                "location_type": "APPROXIMATE",
                 "location": {
                    "lat": 42.2808256,
                     "lng": -83.7430378
                }
            },
            "address_components": [
                {
                    "long_name": "Ann Arbor",
                     "types": [
                        "locality",
                         "political"
                    ],
                    "short_name": "Ann Arbor"
                }
             ],
             "formatted_address": "Ann Arbor, MI, USA",
             "types": [
                "locality",
                "political"
            ]
        }
    ]
}

{'status': 'OK',
 'results': [{'geometry': {'location_type': 'APPROXIMATE',
    'location': {'lat': 42.2808256, 'lng': -83.7430378}},
   'address_components': [{'long_name': 'Ann Arbor',
     'types': ['locality', 'political'],
     'short_name': 'Ann Arbor'}],
   'formatted_address': 'Ann Arbor, MI, USA',
   'types': ['locality', 'political']}]}

-- 不一定能跑出来的的Google API version


In [13]:
import urllib.request, urllib.parse, urllib.error
import json

serviceurl = 'http://maps.googleapis.com/maps/api/geocode/json?'

while True:
    address = input('Enter location: ')
    if len(address) < 1: break

    url = serviceurl + urllib.parse.urlencode({'address': address})

    print('Retrieving', url)
    uh = urllib.request.urlopen(url)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)      
        break
        

    lat = js["results"][0]["geometry"]["location"]["lat"]
    lng = js["results"][0]["geometry"]["location"]["lng"]
    print('lat', lat, 'lng', lng)
    location = js['results'][0]['formatted_address']
    print(location)
    break

    
    
    
    

########################
# input:Ann Arbor, MI

Enter location: Ann Arbor, MI
Retrieving http://maps.googleapis.com/maps/api/geocode/json?address=Ann+Arbor%2C+MI
Retrieved 237 characters
==== Failure To Retrieve ====
{
   "error_message" : "You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account",
   "results" : [],
   "status" : "REQUEST_DENIED"
}



-- 一定能跑出来的的Google API version, up to daily limits 2500 request

In [120]:
import urllib.request, urllib.parse, urllib.error
import json
import ssl

api_key = False
# If you have a Google Places API key, enter it here
# api_key = 'AIzaSy___IDByT70'
# https://developers.google.com/maps/documentation/geocoding/intro

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/json?'
else :
    serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
    address = input('Enter location: ')
    if len(address) < 1: break

    parms = dict()
    parms['address'] = address
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)
        continue

    print(json.dumps(js, indent=4))

    lat = js['results'][0]['geometry']['location']['lat']
    lng = js['results'][0]['geometry']['location']['lng']
    print('lat', lat, 'lng', lng)
    location = js['results'][0]['formatted_address']
    print(location)
    break
    
    

########################
# input:Ann Arbor, MI

Enter location: Ann Arbor, MI
Retrieving http://py4e-data.dr-chuck.net/json?address=Ann+Arbor%2C+MI&key=42
Retrieved 1736 characters
{
    "results": [
        {
            "address_components": [
                {
                    "long_name": "Ann Arbor",
                    "short_name": "Ann Arbor",
                    "types": [
                        "locality",
                        "political"
                    ]
                },
                {
                    "long_name": "Washtenaw County",
                    "short_name": "Washtenaw County",
                    "types": [
                        "administrative_area_level_2",
                        "political"
                    ]
                },
                {
                    "long_name": "Michigan",
                    "short_name": "MI",
                    "types": [
                        "administrative_area_level_1",
                        "political"
                    ]
            

# Homework 6 Part 1 
-- Extracting Data from JSON

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/json2.py. The program will prompt for a URL, read the JSON data from that URL using urllib and then parse and extract the comment counts from the JSON data, compute the sum of the numbers in the file and enter the sum below:

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.json (Sum=2553) <br/>
Actual data: http://py4e-data.dr-chuck.net/comments_448550.json (Sum ends with 39)

In [72]:
import urllib.request, urllib.parse, urllib.error
import json

def get_json(url):

    uh = urllib.request.urlopen(url)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    js = json.loads(data)
    Sum = 0 

    for i in range(len(js['comments'])):
        Sum += int(js['comments'][i]['count'])
        
    return Sum 



In [75]:
# this is a sample question
url = 'http://py4e-data.dr-chuck.net/comments_42.json'
print(get_json(url))

# this is a test question
url = 'http://py4e-data.dr-chuck.net/comments_448550.json'
print(get_json(url))

Retrieved 2711 characters
2553
Retrieved 2734 characters
2839


# Homework 6 Part 2
-- In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geojson.py. The program will prompt for a location, contact a web service and retrieve JSON for the web service and parse that data, and retrieve the first place_id from the JSON. A place ID is a textual identifier that uniquely identifies a place as within Google Maps.

-- You can test to see if your program is working with a location of "South Federal University" which will have a place_id of "ChIJ9e_QQm0sDogRhUPatldEFxw".

In [117]:
import urllib.request, urllib.parse, urllib.error
import json
import ssl

def find_KeyId(place):
    
    api_key = False
    # If you have a Google Places API key, enter it here
    # api_key = 'AIzaSy___IDByT70'
    # https://developers.google.com/maps/documentation/geocoding/intro

    if api_key is False:
        api_key = 42
        serviceurl = 'http://py4e-data.dr-chuck.net/json?'
    else :
        serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'

    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    address = place
    if len(address) < 1: 
        print('error')
        

    parms = dict()
    parms['address'] = address
    
    if api_key is not False: 
        parms['key'] = api_key
    
    url = serviceurl + urllib.parse.urlencode(parms)
    print('Retrieving', url)
        
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None
                   
            
    return print('The Key ID of the given input ' + place + " is " + str(js['results'][0]['place_id']))

    


In [119]:
# this is a sample question
place = 'South Federal University'
find_KeyId(place)

# this is a test question
place = 'K-State'
find_KeyId(place)

Retrieving http://py4e-data.dr-chuck.net/json?address=South+Federal+University&key=42
Retrieved 2291 characters
The Key ID of the given input South Federal University is ChIJ9e_QQm0sDogRhUPatldEFxw
Retrieving http://py4e-data.dr-chuck.net/json?address=K-State&key=42
Retrieved 1807 characters
The Key ID of the given input K-State is ChIJSXQyV43NvYcRdRt537z5Zg0
