# Week 1 - Chapter 11.1 - Regular Expressions
## Understanding Regular Expressions

Refer to [Regular Expression Quick Guide](https://www.py4e.com/html3/11-regex)

`^` Matches the beginning of the line.

`$` Matches the end of the line.

`.` Matches any character (a wildcard).

`\s` Matches a whitespace character.

`\S` Matches a non-whitespace character (opposite of \s).

`*` Applies to the immediately preceding character(s) and indicates to match zero or more times.

`*?` Applies to the immediately preceding character(s) and indicates to match zero or more times in “non-greedy mode”.

`+` Applies to the immediately preceding character(s) and indicates to match one or more times.

`+?` Applies to the immediately preceding character(s) and indicates to match one or more times in “non-greedy mode”.

`?` Applies to the immediately preceding character(s) and indicates to match zero or one time.

`??` Applies to the immediately preceding character(s) and indicates to match zero or one time in “non-greedy mode”.

`[aeiou]` Matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.

`[a-z0-9]` You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.

`[^A-Za-z]` When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.

`( )` When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().

`\b` Matches the empty string, but only at the start or end of a word.

`\B` Matches the empty string, but not at the start or end of a word.

`\d` Matches any decimal digit; equivalent to the set [0-9].

`\D` Matches any non-digit character; equivalent to the set [^0-9].

### Using `re.search()` like `find()`

In [1]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.find('From:') >= 0:
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [2]:
import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


### Wild-Card Characters
- The dot character matches any character
- If you add the asterisk character, the character is "any number of times"

Use `^X.*:`

For this example:

```
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
```

## Fine-Tuning Your Match

Use `^X-\S+:`

For this example:

```
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
```

### Matching and Extracting Data
- `re.search()` returns a T/F depending on whether the string matches the regular expression
- If we want to extract the matching strings, use `re.findall()`
- `[0-9]+` means one or more digits

In [3]:
import re

# returns a list of matching numbers
x = 'My 2 favourite numbers are 6 and 17'
y = re.findall('[0-9]+',x)
print(y)

['2', '6', '17']


In [4]:
# there are no upper case vowels in the string x, returns empty list
y = re.findall('[AEIOU]+', x)
print(y)

[]


### Warning: Greedy Matching
- The repeat charracters `(*` and `+)` push outward in both directions (greedy) to match the largest possible string

In [5]:
import re
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

['From: Using the :']


### Non-Greedy Matching
- Not all regular expression repeat codes are greedy!
- If you add a `?` character, the `+` and `*` chill out a bit.

In [6]:
import re
x = 'From: Using the : character'
y = re.findall('^F.+?:', x)
print(y)

['From:']


### Fine-tuning String Extraction
- You can redefine the match for `re.findall()` and separately determine which portion of the match is to be extracted by using parenthesis.
- `\S+` means at least one non-whitespace character

In [7]:
x = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('\S+@\S+', x)
print(y)

['stephen.marquard@uct.ac.za']


In [8]:
# Match the 'From email@email' but extract just the email portion
x = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('From (\S+@\S+)', x)
print(y)

['stephen.marquard@uct.ac.za']


### The Double Split Pattern (Old)
- Sometimes we split a line one way, then grab one of t he pieces of the line and split that piece again.

In [9]:
line = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'

words = line.split()
print(words)

email = words[1]
print(email)

pieces = email.split('@')
print(pieces[1])

['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jun', '5', '09:14:16', '2008']
stephen.marquard@uct.ac.za
uct.ac.za


### The Double Split Pattern (Regex Version)
`@([^ ]*)` look thru the string until you find an at sign

`[^ ]` match non-blank character

`*` match many of them

In [10]:
import re
line = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('@([^ ]*)', line)
print(y)

['uct.ac.za']


### The Double Split Pattern (Even Cooler Regex Version)
`^From .*@([^ ]*)` 

Start at the beginning of the line, look for the string 'From

look thru the string until you find an at sign

`[^ ]` match non-blank character

`*` match many of them

In [11]:
import re
line = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)', line)
print(y)

['uct.ac.za']


### Spam Confidence

In [12]:
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall('X-DSPAM-Confidence: ([0-9.]+)', line)
    # stuff returns a list of numbers
    # if there are more than 1 item in list, means something wrong
    if len(stuff) != 1 : continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum:', max(numlist))

Maximum: 0.9907


### Escape Character
- If you want a special regular expression character to just behave normally (most of the time) you prefix it with a '\'


In [13]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
print(y)

['$10.00']


In [14]:
# quiz
import re
line = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('@(\S+)', line)
print(y)

['uct.ac.za']


In [15]:
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

['From: Using the :']


In [16]:
# quiz
import re
line = 'From stephen.marquard@uct.ac.za Sat Jun  5 09:14:16 2008'
y = re.findall('\S+?@\S+', line)
print(y)

['stephen.marquard@uct.ac.za']


# Assignment Extracting Data With Regular Expressions

Finding Numbers in a Haystack

In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.

Data Files
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/regex_sum_42.txt (There are 90 values with a sum=445833)

Actual data: http://py4e-data.dr-chuck.net/regex_sum_1088547.txt (There are 67 values and the sum ends with 236)

These links open in a new window. Make sure to save the file into the same folder as you will be writing your Python program. Note: Each student will have a distinct data file for the assignment - so only use your own data file for analysis.

Data Format

The file contains much of the text from the introduction of the textbook except that random numbers are inserted throughout the text. Here is a sample of the output you might see:

```
Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative 
7 and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem.  This book assumes that 
everyone needs to know how to program ...
```
The sum for the sample text above is 27486. The numbers can appear anywhere in the line. There can be any number of numbers in each line (including none).

**Handling The Data**

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of '[0-9]+' and then converting the extracted strings to integers and summing up the integers.

In [17]:
# import regex
import re
sum = 0

# open the file, use try-except to handle exceptions gracefully
try:
    fh = open('regex_sum_1088547.txt')
except:
    print("Unable to open file!")
    quit()

# read the file and use regular expressions to extract the numbers
for line in fh:
    line = line.rstrip()
    extracted = re.findall('[0-9]+', line)
     # if the extracted list of strings has length 0, means no numbers detected. Thus, skip (continue)
    if len(extracted) == 0 : continue
    #print(extracted)
    
    # for each number in the list of strings, parse to integers and sum them up
    for number in extracted:
        sum = sum + int(number)
        
print("Sum:", sum)

Sum: 318236


### Optional: Just for Fun
There are a number of different ways to approach this problem. While we don't recommend trying to write the most compact code possible, it can sometimes be a fun exercise. Here is a a redacted version of two-line version of this program using list comprehension:

```
import re
print( sum( [ ****** *** * in **********('[0-9]+',**************************.read()) ] ) )
```

### Week 3 - Chapter 12 - Networked Technology
## Hypertext Transfer Protocol (HTTP)

In [18]:
import socket

# init new socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect to stated URL at port 80
mysock.connect(('data.pr4e.org', 80))

# encode the string to send aka convert Unicode to UTF-8
# because strings in Python are in Unicode, but you need to send in UTF-8 bytes.
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()

# send encoded string to server
mysock.send(cmd)

# keep looping, keep receiving & printing data until we hit end of transmission & break out of loop
while True: 
    data = mysock.recv(512) # use receive method of socket to get data, receive up to 512 chars
    # if we get no data, that means end of transmission
    if (len(data) < 1):
        break
    print(data.decode())

# close the socket once done
mysock.close()

HTTP/1.1 200 OK
Date: Fri, 04 Dec 2020 15:54:41 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



# Assignment - Understanding the Request / Response Cycle

Exploring the HyperText Transport Protocol

You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers.

http://data.pr4e.org/intro-short.txt

There are three ways that you might retrieve this web page and look at the response headers:

Preferred: Modify the socket1.py program to retrieve the above URL and print out the headers and data.

Make sure to change the code to retrieve the above URL - the values are different for each URL.

Open the URL in a web browser with a developer console or FireBug and manually examine the headers that are returned.


In [19]:
import socket

# init new socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect to stated URL at port 80
mysock.connect(('data.pr4e.org', 80))

# encode the string to send aka convert Unicode to UTF-8
# because strings in Python are in Unicode, but you need to send in UTF-8 bytes.
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()

# send encoded string to server
mysock.send(cmd)

# keep looping, keep receiving & printing data until we hit end of transmission & break out of loop
while True: 
    data = mysock.recv(512) # use receive method of socket to get data, receive up to 512 chars
    # if we get no data, that means end of transmission
    if (len(data) < 1):
        break
    print(data.decode())

# close the socket once done
mysock.close()

HTTP/1.1 200 OK
Date: Fri, 04 Dec 2020 15:54:41 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "1d3-54f6609240717"
Accept-Ranges: bytes
Content-Length: 467
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

Why should you learn to write programs?

Writing programs (or programming) is a very creative 
and rewarding activity.  You can write programs
 for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  



# Week 4 - Chapter 12 - Unicode Characters and Strings

- ASCII (Mapping for numbers to characters)

### Representing Simple Strings

- Each character is represented by a number between 0 and 256 stored in 8 bits of memory.

- We refer to "8 bits of memory as a byte" of memory (i.e. my disk drive contains 3 Terabytes of memory)

- The `ord()` function tells us the numeric value of a simple ASCII character.

In [20]:
# uppercase has lower ordinal than lowercase
print(ord('H'))
print(ord('e'))
print(ord('\n'))

72
101
10


### Multi-Byte Characters

- To represent the wide range of characters computers must handle, we represent characters with more than one byte

    - UTF-16 - Fixed length - Two bytes
    
    - UTF-32 - Fixed length - Four bytes
    
    - UTF-8 - 1~4 bytes
    
      - Upwards compatible with ASCII
      
      - Automatic detection between ASCII and UTF-8
      
      - UTF-8 is recommended practice for encoding data to be exchanged between systems
      
- In Python, all strings are unicode.

- When we talk to a network resource using sockets or talk to a database we have to encode and decode data (usually to UTF-8)

### Python Strings to Bytes

- When we talk to an external resource like a network socket,  we send bytes.

- So, we need to encode Python 3 strings into a given character encoding.

- When we read data from an external source, we must decode it based on the character set so it is properly represented in Python 3 as a string.

### Retrieving Web Pages using `urllib` in Python

- We have a library that does all the socket work for us and make web pages look like a file

In [21]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())
    
counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
{}


### Reading web pages

In [22]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


### Web Scraping

- When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages

- Search engines scrape web pages - we call this "spidering the web" or "web crawling"

### Another library: Beautiful Soup

- Library to easily parse web pages

In [23]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [24]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

# url = 'http://www.dr-chuck.com/page2.htm'
url = input('Enter -')
html = urllib.request.urlopen(url).read() # open and read url
soup = BeautifulSoup(html, 'html.parser') # parse with html parser

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

Enter - http://www.dr-chuck.com/page2.htm


page1.htm


# Assignment - Scraping Numbers from HTML using BeautifulSoup

In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)

Actual data: http://py4e-data.dr-chuck.net/comments_1088549.html (Sum ends with 56

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

Data Format

The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:

```
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
```

You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.
Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.
    
```
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
   # Look at the parts of a tag
   print 'TAG:',tag
   print 'URL:',tag.get('href', None)
   print 'Contents:',tag.contents[0]
   print 'Attrs:',tag.attrs
```
You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.
    
Sample Execution

```
$ python3 solution.py
Enter - http://py4e-data.dr-chuck.net/comments_42.html
Count 50
Sum 2...
````

In [25]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
#print(soup)

count = 0
sum = 0

# Retrieve all of the span tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    #print('TAG:', tag)
    #print('Contents:', tag.contents[0])
    
    # convert to integer
    num = int(tag.contents[0])
    count += 1
    sum += num
    
print("Count", count)
print("Sum", sum)

Enter -  http://py4e-data.dr-chuck.net/comments_42.html


Count 50
Sum 2553


In [26]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_42.html')
for line in fhand:
    print(line.decode().strip())


<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the sample data for testing</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Romina</td><td><span class="comments">97</span></td></tr>
<tr><td>Laurie</td><td><span class="comments">97</span></td></tr>
<tr><td>Bayli</td><td><span class="comments">90</span></td></tr>
<tr><td>Siyona</td><td><span class="comments">90</span></td></tr>
<tr><td>Taisha</td><td><span class="comments">88</span></td></tr>
<tr><td>Alanda</td><td><span class="comments">87</span></td></tr>
<tr><td>Ameelia</td><td><span class="comments">87</span></td></tr>
<tr><td>Prasheeta</td><td><span class="comments">80</span></td></tr>
<tr><td>Asif</td><td><span class="comments">79</span></td></tr>
<tr><td>Risa</td><td><span class="comments">79</span></td></tr>
<tr><td>Zi</td><td><span class="comments">78</span></td></tr>
<tr><td>Danyil</td><td><span class="comments">76</span></td></tr

# Assignment - Following Links in HTML Using BeautifulSoup

In [32]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = input('Enter URL: ')
count = input('Enter count: ')

try:
    icount = int(count)
except:
    print("Not a valid number!")
    quit()

pos = input('Enter position: ')

try:
    ipos = int(pos)
except:
    print("Not a valid number!")
    quit()

def follow(url, count, pos):
    # always print the first given url
    print('Retrieving:', url)

    # for-loop iterates count times
    for i in range(count):
        position = 0 # track position when looping thru tags, start at 0 means targetted at pos-1
        html = urllib.request.urlopen(url, context=ctx).read()
        soup = BeautifulSoup(html, 'html.parser')

        # Retrieve all of the a tags
        tags = soup('a')
        for tag in tags:
            # only if we are at specified position, then we retrieve that url
            if position == pos - 1:
                foundUrl = tag.get('href', None)
                url = foundUrl # update the new url to go to next
                print('Retrieving:', foundUrl)
            # if not at pos-1, then we increment position to continue looking
            position += 1

# call the method
follow(url, icount, ipos)

Enter URL:  http://py4e-data.dr-chuck.net/known_by_Fikret.html
Enter count:  4
Enter position:  3


Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html


# Week 5 - Chapter 13 - Parsing XML


In [33]:
import xml.etree.ElementTree as ET

data = '''<person>
<name>Chuck</name>
<phone type="intl">+1 734 303 4456</phone>
<email hide="yes"/> 
</person>'''

tree = ET.fromstring(data) 
print('Name:',tree.find('name').text) 
print('Attr:',tree.find('email').get('hide')) 


Name: Chuck
Attr: yes


In [34]:
import xml.etree.ElementTree as ET

input = '''<stuff>
<users>
    <user x="2">
        <id>001</id>
        <name>Chuck</name>
    </user>
    <user x="7">
        <id>009</id>
        <name>Brent</name>
    </user>
</users>
</stuff>'''

stuff = ET.fromstring(input) 
lst = stuff.findall('users/user') # find all user tag under users
print('User count:', len(lst))
for item in lst:
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Name', item.get('x'))

User count: 2
Name Chuck
Id 001
Name 2
Name Brent
Id 009
Name 7


# Assignment - Extracting Data from XML

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.xml (Sum=2553)

Actual data: http://py4e-data.dr-chuck.net/comments_1088551.xml (Sum ends with 30)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

Data Format and Approach

The data consists of a number of names and comment counts in XML as follows:

```
<comment>
  <name>Matthias</name>
  <count>97</count>
</comment>
```

You are to look through all the <comment> tags and find the <count> values sum the numbers. The closest sample code that shows how to parse XML is geoxml.py. But since the nesting of the elements in our data is different than the data we are parsing in that sample code you will have to make real changes to the code.
To make the code a little simpler, you can use an XPath selector string to look through the entire tree of XML for any tag named 'count' with the following line of code:

```
counts = tree.findall('.//count')
```
    
Take a look at the Python ElementTree documentation and look for the supported XPath syntax for details. You could also work from the top of the XML down to the comments node and then loop through the child nodes of the comments node.
Sample Execution

```
$ python3 solution.py
Enter location: http://py4e-data.dr-chuck.net/comments_42.xml
Retrieving http://py4e-data.dr-chuck.net/comments_42.xml
Retrieved 4189 characters
Count: 50
Sum: 2...
```

In [1]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = 'http://py4e-data.dr-chuck.net/comments_42.xml'
url = input('Enter location:')
print('Retrieving', url)
response = urllib.request.urlopen(url).read()
tree = ET.fromstring(response)

lst = tree.findall('comments/comment') # find all comment under comments
#lst = tree.findall('.//count') # or use XPath selector stirng to look thru entire XML tree for tag 'count'
print('Count:', len(lst))

sum = 0

for item in lst:
    #print('Comment name', item.find('name').text)
    #print('Comment count', item.find('count').text)
    icount = int(item.find('count').text)
    sum += icount
print('Sum', sum)

Enter location: http://py4e-data.dr-chuck.net/comments_42.xml


Retrieving http://py4e-data.dr-chuck.net/comments_42.xml
Count: 50
Sum 2553


# Week 6 - Chapter 12.5 - JavaScript Object Notation (JSON)

- Aka in Python dictionary

In [2]:
import json
data = '''
{
    "name": "Chuck",
    "phone": {
        "type": "intl",
        "number": "+1 734 303 4456"
    },
    "email":{
        "hide": "yes"
    }
}'''

info = json.loads(data)
print('Name:', info["name"])
print('Hide:', info["email"])

Name: Chuck
Hide: {'hide': 'yes'}


### Using Application Programming Interfaces (APIs)

- Doesnt work cos requires Google API key (might be purchase)
- alternative url found in discussion forums
http://py4e-data.dr-chuck.net/json?address=Ann+Arbor%2C+MI&key=42

- Input: Ann Arbor, MI


In [7]:
import urllib.request, urllib.parse, urllib.error
import json

serviceUrl = 'http://py4e-data.dr-chuck.net/json?'

# DISABLED LOOP COS KERNAL WILL WAIT FOR INPUT IF U RESTART AND RUN ALL
#while True:
address = input('Enter location: ')
#if len(address) < 1: break

url = serviceUrl + urllib.parse.urlencode({'address': address, 'key': 42})

print('Retrieving', url)
uh = urllib.request.urlopen(url)
data = uh.read().decode()
print('Retrieved', len(data), 'characters')

try:
    js = json.loads(data)
except:
    js = None

if not js or 'status' not in js or js['status'] != 'OK':
    print('=== Failre To Retrieve ====')
    print(data)
    #continue

lat = js["results"][0]["geometry"]["location"]["lat"]
lng = js["results"][0]["geometry"]["location"]["lng"]
print('lat', lat, 'lng', lng)
location = js['results']
print(location)

Enter location:  Ann Arbor, MI


Retrieving http://py4e-data.dr-chuck.net/json?address=Ann+Arbor%2C+MI&key=42
Retrieved 1736 characters
lat 42.2808256 lng -83.7430378
[{'address_components': [{'long_name': 'Ann Arbor', 'short_name': 'Ann Arbor', 'types': ['locality', 'political']}, {'long_name': 'Washtenaw County', 'short_name': 'Washtenaw County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'Michigan', 'short_name': 'MI', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}], 'formatted_address': 'Ann Arbor, MI, USA', 'geometry': {'bounds': {'northeast': {'lat': 42.3239728, 'lng': -83.6758069}, 'southwest': {'lat': 42.222668, 'lng': -83.799572}}, 'location': {'lat': 42.2808256, 'lng': -83.7430378}, 'location_type': 'APPROXIMATE', 'viewport': {'northeast': {'lat': 42.3239728, 'lng': -83.6758069}, 'southwest': {'lat': 42.222668, 'lng': -83.799572}}}, 'place_id': 'ChIJMx9D1A2wPIgR4rXIhkb5Cds', 'types': ['lo

### Securing API Requests
- Authorization and Authentication

In [4]:
import urllib.request, urllib.parse, urllib.error
import twurl
import json

TWITTER_URL = 'https://api.twitter.com/1.1/friends/list.json'

while True:
    print('')
    acct = input('Enter Twitter Account: ')
    if (len(acct) < 1): break
    url = twurl.augment(TWITTER_URL,
                        {'screen_name': acct, 
                        'count': 5})
    print('Retrieving', url)
    connection = urllib.request.urlopen(url)
    data = connection.read().decode()
    headers = dict(connection.getheaders())
    print('Remaining', headers['x-rate-limit-remaining'])
    js = json.loads(data)
    print(json.dumps(js, indent=4))
    
    for u in js['users']:
        print(u['screen_name'])
        s = u['status']['text']
        print('  ', s[:50])

ModuleNotFoundError: No module named 'twurl'

In [2]:
pip install twitter

Note: you may need to restart the kernel to use updated packages.


Note: Idk, skip

# Assignment - Extracting Data from JSON

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/json2.py. The program will prompt for a URL, read the JSON data from that URL using urllib and then parse and extract the comment counts from the JSON data, compute the sum of the numbers in the file and enter the sum below:
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.json (Sum=2553)

Actual data: http://py4e-data.dr-chuck.net/comments_1088552.json (Sum ends with 88)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

Data Format
The data consists of a number of names and comment counts in JSON as follows:

```
{
  comments: [
    {
      name: "Matthias"
      count: 97
    },
    {
      name: "Geomer"
      count: 97
    }
    ...
  ]
}
```

The closest sample code that shows how to parse JSON and extract a list is json2.py. You might also want to look at geoxml.py to see how to prompt for a URL and retrieve data from a URL.

Sample Execution

```
$ python3 solution.py
Enter location: http://py4e-data.dr-chuck.net/comments_42.json
Retrieving http://py4e-data.dr-chuck.net/comments_42.json
Retrieved 2733 characters
Count: 50
Sum: 2...
```

In [18]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = 'http://py4e-data.dr-chuck.net/comments_42.json'
url = input('Enter location: ')
print('Retrieving', url)

response = urllib.request.urlopen(url).read()
data = response.decode()
print('Retrieved', len(data), 'characters')

info = json.loads(response)

count = 0
sum = 0

for item in info['comments']:
    #print('Name:', item['name'])
    #print('Count:', item['count'])
    count +=1
    sum += int(item['count'])
    
print('Count:', count)
print('Sum:', sum)

Enter location:  http://py4e-data.dr-chuck.net/comments_1088552.json


Retrieving http://py4e-data.dr-chuck.net/comments_1088552.json
Retrieved 2740 characters
Count: 50
Sum: 2588


# Assignment - Using the GeoJSON API
In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geojson.py. The program will prompt for a location, contact a web service and retrieve JSON for the web service and parse that data, and retrieve the first place_id from the JSON. A place ID is a textual identifier that uniquely identifies a place as within Google Maps.
API End Points

To complete this assignment, you should use this API endpoint that has a static subset of the Google Data:

```
http://py4e-data.dr-chuck.net/json?
```

This API uses the same parameter (address) as the Google API. This API also has no rate limit so you can test as often as you like. If you visit the URL with no parameters, you get "No address..." response.
To call the API, you need to include a key= parameter and provide the address that you are requesting as the address= parameter that is properly URL encoded using the urllib.parse.urlencode() function as shown in http://www.py4e.com/code3/geojson.py

Make sure to check that your code is using the API endpoint is as shown above. You will get different results from the geojson and json endpoints so make sure you are using the same end point as this autograder is using.

Test Data / Sample Execution

You can test to see if your program is working with a location of "South Federal University" which will have a place_id of "ChIJJ2MNmPl_bIcRt8t5x-X5ZhQ".

```
$ python3 solution.py
Enter location: South Federal University
Retrieving http://...
Retrieved 2290 characters
Place id ChIJJ2MNmPl_bIcRt8t5x-X5ZhQ
Turn In
```

Please run your program to find the place_id for this location:
```
University of Texas at Austin
```

Make sure to enter the name and case exactly as above and enter the place_id and your Python code below. Hint: The first seven characters of the place_id are "ChIJt8- ..."
Make sure to retreive the data from the URL specified above and not the normal Google API. Your program should work with the Google API - but the place_id may not match for this assignment.

In [None]:
import urllib.request, urllib.parse, urllib.error
import json
import ssl

api_key = False
# If you have a Google Places API key, enter it here
# api_key = 'AIzaSy___IDByT70'
# https://developers.google.com/maps/documentation/geocoding/intro

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/json?'
else :
    serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
    address = input('Enter location: ')
    if len(address) < 1: break

    parms = dict()
    parms['address'] = address
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)
        continue
  
    placeid = js['results'][0]['place_id']
    print('Place id', placeid)

Enter location:  University of Texas at Austin


Retrieving http://py4e-data.dr-chuck.net/json?address=University+of+Texas+at+Austin&key=42
Retrieved 1782 characters
Place id ChIJt8-EJZu1RIYR3iFKF0_uMYE
