This is a Jupyter notebook for learning the Coursera "Programming for Everybody (Getting Started with Python)" UMich course. This is Course 3 in the Programming for Everybody Specialization.

Course website: https://www.coursera.org/learn/python

Start Date: 8/23/2018
End Date: 9/18/2018

The professor of this course, Chuck Saverance, has a website for his book: "Python for Everybody".

Link available here: https://www.py4e.com/

You can download all of the sample Python code from the book as well as licensed course materials from http://www.py4e.com/materials.

This course is in Python 3 and the textbook for this class is in Python 3. Prior to July 2017, this course was taught in Python 2 with the textbook "Python for Informatics: Exploring Information". This earlier Python 2 book has translations into Spanish, Korean, and Chinese. The Python 2 book and its translations are still available at the www.pythonlearn.com web site.

All of the book materials are available under a Creative Commons Attribution-NonCommercial 3.0 Unported License. The slides, audio, assignments, auto grader and all course materials other than the book are available from http://www.py4e.com/materials under the more flexible Creative Commons Attribution 3.0 Unported License. If you are curious as to why the "NC" variant of Creative Commons was used, see Appendix D of the textbook or search through my blog posts for the string "copyright".

## Chapter 11: Regular Expressions

Regular expressions are not built into the language like strings, etc.  So have to import the library using import re.

In [2]:
import re

You can use re.search() to see if a string matches a regular expression, similar to using the find() method for strings.  Re.search() returns a Boolean depending on whether the string matches the regular expression.


In [4]:
#Returns an entire line of text file from any lines containing "is".
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('is', line):
        print(line)

This is a message box.
Oh what fun it is


You can also use re.search() like the startswith() string function.

In [11]:
#Returns an entire line of text file from any lines where "From:" is at the BEGINNING of the line.
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)

From: Russia with Love


Some more on regular expressions:

-The dot character matches any character.  

-If you add the asterisk character, the character is "0 or more times"

-Therefore a search for the below regular expression...

^X.*:

...is a search for any set of characters where X is at the beginning of the line, followed by any number of characters, followed by a colon. 

-Using "+" instead of "*" in a regular expression looks for a character occuring "1 or more times".

-"\S" in a regular expression matches any non-whitespace character.

-The below regular expression...

^X-\S+:

...is a search for any set of characters where "X-" starts the line and is contained one or more times, but only for non-whitespace characters.

You can use re.findall() to extract portions of a string that match your regular expression, similar to a combination of find() and slicing: var[5:10].  Re.findall() returns a list of all the matching strings to be extracted.

In [3]:
#Find all digits occuring once or more in a string.
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+',x) 
print(y)

['2', '19', '42']


In [None]:
#Find all capital occuring once or more in a string that does not contain it returns an empty list.
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[AEOIU]+',x) 
print(y)

Note: Both the * and the + push outwards as far as they can.  

Example: A search for ^F.+: within "From: Russia with : love" returns "From: Russia with :", not "From:"

This is called "Greedy Matching".  But if you add a "?", it prefers the shortest.



## Chapter 12: Networked Technology

HTTP steps:
1) Connect to server
2) Request a document (ex., GET http://www.dr-chuck.com/page1.htm HTTP/1.0)

In [10]:
#NOTE: Code below does not work.
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() #Encode() converts from unicode to UTF-8.
mysock.send(cmd)  #You can either send or receive in HTML.  Convention is send.

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

HTTP/1.1 400 Bad Request
Date: Tue, 11 Sep 2018 06:03:17 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 308
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
</body></html>



### Unicode characters and strings

ASCII: American Standard Code for Information Interchange

Letters are represented by numbers

The number for "H" is 101.
The number for newline is 101.

...and so on.

Each character is represented by a number between 0 and 256 stored in 8 bits of memory.  

The ord() function returns the numeric value of a simple ASCII character.

In [7]:
print(ord('H'))
print(ord('h'))
print(ord('\n'))

72
104
10


Unicode: Universal code for hundreds of millions of different characters, so characters can talk to each other.

UTF-8 is the best: 1-4 bytes. Recommended practice for encoding data to be exchanged between systems.  

In Python, everything is Unicode.  Not UTF-8, 16, etc. Note: Python 2 did have something called a bytestring, which Python 3 has gotten rid of. 

decode() takes a byte array, and decodes into UTF-8 by default.



###Retrieving web pages

Don't repeat yourself.  There is a library that does the socket stuff for you, and reads web pages - url lib.

In [11]:
#This works!  4 lines of a code, and we're reading a web page.
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip()) #Line is actually a bytearray, not a string, so needs decode().

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


#Now you can handle data on the internet -- and run higher functions like counting words

In [12]:
#This works!  4 lines of a code, and we're reading a web page.
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split() #Line is actually a bytearray, not a string,
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


We don't have to just read text files -- we can read HTML files.

In [None]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

### HTML and programs that surf the web

There's a library to scrape the web.  Sometimes, you can get in trouble for spidering.  (Not Wikipedia.)  

The biggest problem with spidering is the parsing of HTML.  HTML is so messy, it's hard to use.

"Beautiful Soup": An HTML parser

Have to install Beautiful Soup.  

To run this, you can install BeautifulSoup
https://pypi.python.org/pypi/beautifulsoup4

Or download the file
http://www.py4e.com/code3/bs4.zip

THen import packages like so:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup


In [None]:
#Installing BeautifulSoup
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

In [None]:
#Using BeautifulSoup to read web pages (retrieving the anchor tags)
#Use http://www.dr-chuck.com/page1.htm

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

#Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
    

## Chapter 13: Web Services and XML

So far, we've been using the request-response cycle for data - moving data back and forth.  Web services is a layer on top.

With the HTTP Request / Response well understood and well supported, there was a natural move towards exchanging data between programs using these protocols.

We needed to come up with an agreed way to repesent data going between applications and across networks.

Two types of commonly used data exchange formats: XML and JSON.

The network is not Java or Python; the network is data.  (It is data that moves across the network.)

"Wire protocol": How the data is put on the wire (how data exits one system, goes over wire, and is received on other end).

"Serialization": The act of going from an internal representation on one computer out to a interchange format.  "Serialization" because in the old days, data was transmitted over the wire serially (one character at a time).

"De-serialization": The act of taking an interchange format, and converting it to a new internal representation.

XML and JSON are the two serialization formats.  JSON is more modern than XML.



## XML: eXtensible markup language

XML: Primary purpose is to help information systems share structured data. 


Basics: 

Review: Start tags, End tags, Text content, Attributes, and Self Closing tags

XML Schema: More to follow

Parsing XML in Python: More to follow 

Worked example: XML.  Review later.

## JSON and the REST Architecture

JSON: JavaScript Object Notation (JSON)

JSON is a new serialization format.  Very native to Javascript.  

JSON represents data as nested "lists" and "dictionaries".

In [None]:
#An example JSON block
import json
data = '''{
    "name" : "Chuck",
    "phone" : {
        "type" : "intl",
        "number" : "+1 734 303 4456"
    },
    "email" : {
        "hide" : "yes"
    }
}'''

info = json.loads(data) #loads() stands for "load from string". Returns a Python dictionary.
print('Name:', info["name"]) #Since info is a dictionary, just search by key.
print('Hide:', info["email"]["hide"]) #Since info is a dictionary, just go down two nodes to see the value of "hide".

An example JSON list.

In [None]:
import json
input = '''[
    {"id" : "001",
    "x" : "2",
    "name" : "Chuck"
    },
    {"id" : "009",
    "x" : "7",
    "name" : "Chuck"
    }
]'''

info = json.loads(input) #loads() stands for "load from string". Returns a Python dictionary.
print('User count:', len(info))
for item in info:
    print('Name', item['name'])
    print('Id', item['id'])
    print('Attribute', item['x'])

## Service Oriented Approach

We started by moving data back and forth, and now we're talking about serialization formats, etc.  The result: You end up with an application that provides service to the rest of the application.  

Example: An airline website where you're booking a plane ticket, and they ask you if you want to book a car or hotel.  

Service oriented approach (SOA): You have a website, but you also give out a service, and other systems can work with your service. Business reason: If you're booking hotels from an airline site, the airline will get a kickback from the hotel for bringing people (and business) there.



### Using Application Programming Interfaces (APIs)

Hit Google maps as API:

http:/maps.googleapis.com/maps/api/geocode/json?address=Ann+Arbor%2C+MI

In [None]:
import urllib.request, urllib.parse, urllib.error
import json

serviceurl = 'http://maps.googleapis.com/maps/api/geocode/json?address=Ann+Arbor%2C+MI'

### Using Application Programming Interfaces (APIs)

Twitter also has an API.  Google says from your computer, can only get 25 requests.  Twitter requires authorization before you can get requests.

http://dev.twitter.com/docs/platform-objects/tweets

The following Python code snippets downloaded from Chuck's website.  https://www.py4e.com/code3/

twurl.py, hidden.py, oauth.py, and twitter1.py

An example Twitter code to grab all of your friends.

In [1]:
import urllib.request, urllib.parse, urllib.error
import twurl
import json

TWITTER_URL = 'https://api.twitter/com/1.1/friends/list.json'

while True:
    print('')
    acct = input('Enter Twitter Account:')
    if (len(acct) < 1): break
    url = twurl.augment(TWITTER_URL,
                       {'screen_name': acct, 'count': '5'})
    print('Retrieving', url)
    connection = urllib.request.urlopen(url)
    data = connection.read().decode()
    headers = dict(connection.getheaders())
    print('Remaining', headers['x-rate-limit-remaining'])
    js = json.loads(data)
    print(json.dumps(js, indent = 4))
    
    for u in js['users']:
        print(u['screen_name'])
        s = u['status']['text']
        print('   ', s[:50])

        


Enter Twitter Account:praetorsung
Retrieving https://api.twitter/com/1.1/friends/list.json?oauth_consumer_key=h7Lu...Ng&oauth_timestamp=1537648823&oauth_nonce=25215078&oauth_version=1.0&screen_name=praetorsung&count=5&oauth_token=10185562-eibxCp9n2...P4GEQQOSGI&oauth_signature_method=HMAC-SHA1&oauth_signature=g0IFfTa0vMa%2FYINvku5Aeox7dHM%3D


URLError: <urlopen error [Errno 11001] getaddrinfo failed>