# Network Data

### 1. Transport Control Protocol (TCP)
- Buil on top of IP (Internet Protocol)
- Assumes IP might lose some data - stores and retransmits data if it seems to be lost
- Handles "flow control" using a transmit window
- provides a nice reliable pipe

### 2. TCP onnections / Sockets

"In computer networking, an Internet **socket** or network **socket** is an endpoint of a bidirectional inter-process communication flow across an Internet protocol-based computer network, such as the Internet."

In [6]:
# Built-in support for TCP Sockets
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('google.com', 80) )   # (Host, Port)

# It's not sending the data yet.
# It's like dialing the phone.

Since TCP gives us a reliable socket, what do we want to do with the socket? 
<br>What problem do we want to solve?

**Application Protocols**
- Mail
- World Wide Web

### 3. Hypertext Transfer Protocol (HTTP)
- The dominant Application Layer Protocol on the Internet
- Invented for the Web - to retrieve HTML, Images, Documents, etc
- Extended to be data in addition to documents - RSS, Web Services, etc.

Set of rules to allow browsers to retrieve web documents from servers over the Internet.

http://www.dr-chuck.com/page1.htm  ( protocol / host / document)

##### - Getting Data From The Server (Request - Response Cycle)
Each time the user clicks on an anchor tag with an href= value to switch to a new page, the browser makes a connection to the web server and issues a "GET" request - to GET the content of the page at the specified URL.

(telnet)
telnet www.dr-chuck.com 80

In [26]:
# in telnet (Command LINE)
!telnet data.pr4e.org 80
!echo -e "GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n"

Trying 192.241.136.170...
Connected to data.pr4e.org.
Escape character is '^]'.
Connection closed by foreign host.
-e GET http://data.pr4e.org/romeo.txt HTTP/1.0




In [27]:
# An HTTP Request in Python
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
    # make a request
    # \n\n is like a grammar
    # .encode to convert Unicode to UTF-8
mysock.send(cmd)

while True :
    data = mysock.recv(512)
    if (len(data) < 1) :
        break
        # End of File , or end of transmission
    print(data.decode())
        # UTF-8 to Unicode
mysock.close()

HTTP/1.1 400 Bad Request
Date: Sat, 10 Feb 2018 13:26:26 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 307
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.7 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
</body></html>



# Python Exercise

### 1. Using urllib in Python
Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file.

In [38]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
    # filehandle, same as 'open' for a file
for line in fhand :
    print(line.decode().strip())
    # strip : return a copy of the string with leading and trailing char removed
    # 인자가 생략되거나 None이 전달되면, whitespace 문자열이 사라짐
    # 이 경우엔 \n이 stripped.

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [52]:
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
counts = dict()
for line in fhand :
    words = line.decode().split()
    for word in words :
        counts[word] = counts.get(word, 0) + 1
        # dict.get은 지정된 key의 value를 반환하거나,
        # 그 값이 없을 경우 옆의 인자를 (이 경우엔 0) 반환한다!
print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


In [56]:
# Reading Web Pages
fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


### 2. Web Scraping
- When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages
- Search engines scrape web pages - we call this "spidering the web" or "web crawling"

##### - Why scrape?
- Pull data - particularly social data - who links to who?
- Get your own data back out of some system that has no "export capability"
- Monitor a site for new information
- Spider the web to make a database for a search engine

##### - The Easy Way -> Beautiful Soup
- You could do string searches the hard way
- Or use the free software library called **BeautifulSoup** from www.crummy.com 

In [60]:
from bs4 import BeautifulSoup

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
    # read the whole line
soup = BeautifulSoup(html, 'html.parser')
    # parsing
    
# Retrieve all of the anchor tags
tags = soup('a')
    # give me a list of anchor tags
for tag in tags :
    print(tag.get('href', None))
    # like dictionary

Enter - https://www.google.com
https://www.google.co.kr/imghp?hl=ko&tab=wi
https://maps.google.co.kr/maps?hl=ko&tab=wl
https://play.google.com/?hl=ko&tab=w8
https://www.youtube.com/?gl=KR&tab=w1
https://news.google.co.kr/nwshp?hl=ko&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.kr/intl/ko/options/
http://www.google.co.kr/history/optout?hl=ko
/preferences?hl=ko
https://accounts.google.com/ServiceLogin?hl=ko&passive=true&continue=https://www.google.co.kr/%3Fgfe_rd%3Dcr%26dcr%3D0%26ei%3DjP5-WrvfEo7A9AXMnbPQCA
/search?dcr=0&site=&ie=UTF-8&q=2018+%EB%8F%99%EA%B3%84+%EC%98%AC%EB%A6%BC%ED%94%BD&oi=ddle&ct=2018-doodle-snow-games-day-4-6210713008734208-lawcta&hl=ko&kgmid=/m/03tng8&sa=X&ved=0ahUKEwiwoPO8xJvZAhUBo5QKHaz2C6wQPQgD
/advanced_search?hl=ko&authuser=0
/language_tools?hl=ko&authuser=0
/intl/ko/ads/
http://www.google.co.kr/intl/ko/services/
https://plus.google.com/102197601262446632410
/intl/ko/about.html
https://www.google.co.kr/setpr

# Data On the Web
- With the HTTP request/response well understood and well supported, there was a natural move toward exchanging data btw programs using these protocols.
- We needed to come up with an agreed way to represent data going btw applications and across networks.
- There are two commonly used formats : **XML and JSON**

XML이 좀 더 오래되었고 복잡함.
JSON이 더 최근 것이고, 파이썬의 딕셔너리와 비슷하다.

##### - Sending Data across the "Net"
ex. (Python Dictionary) -> Wire -> (Java HashMap)

We use **Serialization format** (XML & JSON)

## 1. XML (eXtensible Markup Language)
- Primary purpose is to help information systems **share structured data**
- It started as a simplified subset of the SGML (Standard Generalized ~), and is designed to be relatively human-legible

### 1) XML "elements" (or Nodes)

White space doesn't matter. 
<br> We indent only to be **readable.**

- Tags : indicate the beginning and ending of elements
- Attributes : Keyword/value pairs on the opening tag of XML
- Serialize / De-Serialize : Act of converting data in one program into a common format that can be stored and/or transmitted btw systems in a programming language-independent manner

#### *XML as a Tree
![title](./XMLTree.png)

#### *XML as a Path

/a/b : X
<br>/a/c/d : Y
<br>/a/c/e : Z

### 2) XML Schema
- Description of the legal format of an XML document
- Often used to specify a "**contract**" btw systems : **XML Validation**


##### -Many XML Schema Languages
- Document Type Definition (DTD)
- SGML
- XML Schema from W3C (XSD) : **most widely used**


### 3) XSD Schema 
W3C : World Wide Web Consortium


- Constraints
- Data Types
    - string
    - date
    - dateTime
    - decimal
    - integer

### 4) Parsing XML

In [97]:
# Simple Tag! 

import xml.etree.ElementTree as ET
data = '''<person>
<name>Chuck</name>
<phone type="int1">
+1 734 303 4456 </phone>
<email hide="yes"/>
</person>
'''

tree = ET.fromstring(data)
    #fromstring converts 'string' to 'tree'
print('Name:', tree.find('name').text)
    # .text는 tag사이의 내용을 의미
print('Attr:', tree.find('email').get('hide'))
    # 'hide'라는 attribute를 보여주라는 의미

Name: Chuck
Attr: yes


In [103]:
imput = '''<stuff>
<users>
    <user x="2">
        <id>001</id>
        <name>Chuck</name>
    </user>
    <user x="7">
        <id>009</id>
        <name>Brent</name>
    </user>
</users>
</stuff>'''

stuff = ET.fromstring(imput)
lst = stuff.findall('users/user')
print('User count:', len(lst))
for item in lst :
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Attribute', item.get("x"))
    

User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7


## 2. JSON (JavaScrip Object Notation)

JSON represents data as nested "lists" and "dictionaries".
<br>So, it's easier to use.

In [111]:
import json
data = '''
{
    "name": "Chuck",
    "phone" : {
        "type" : "intl",
        "number" : "+1 734 303 4456"
    },
    "email" : {
        "hide" : "yes"
    }
}
'''

info = json.loads(data)
print('Name:', info["name"])
print('Hide:', info["email"]["hide"])

Name: Chuck
Hide: yes


In [116]:
info

{'email': {'hide': 'yes'},
 'name': 'Chuck',
 'phone': {'number': '+1 734 303 4456', 'type': 'intl'}}

## 3. Using APIs
- Most non-trivial web applications use services
- They use services from other applications
    - Credit Card Change
    - Hotel Reservation Systems
- Services publish the "rules" applications must follow to make use of the service (**API**)

### 1) API (Application Programming Interface)
We'll use Google Geocoding API

In [128]:
# http://maps.googleapis.com/maps/api/geocode/json?address=Ann+Arbor%2C+MI

import urllib.request, urllib.parse, urllib.error
import json

serviceurl = "http://maps.googleapis.com/maps/api/geocode/json?"

while True :
    address = input('Enter location: ')
    if len(address) < 1 : break
        
    url = serviceurl + urllib.parse.urlencode({'address' : address})
    
    print("Retrieving", url)
    uh = urllib.request.urlopen(url)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')
    
    try :
        js = json.loads(data)
    except :
        js = None
    
    if not js or 'status' not in js or js['status'] != 'OK' :
        print("==== Failure To Retrieve ====")
        print(data)
        continue
    
    lat = js["results"][0]["geometry"]["location"]["lat"]
    lng = js["results"][0]["geometry"]["location"]["lng"]
    print('lat', lat, 'lng', lng)
    location = js['results'][0]['formatted_address']
    print(location)
    break

Enter location: hwagok
Retrieving http://maps.googleapis.com/maps/api/geocode/json?address=hwagok
Retrieved 1514 characters
lat 37.54157800000001 lng 126.840436
Hwagok-dong, Seoul, South Korea


#### *Real JSON dat for SEOUL
![title](./JSON.png)

### 2) Securing API Requests

- The compute resources to run these APIs are not "free"
- The data provided by these APIs is usually valuable
- The data providers might limit the number of requests per day, demand an API 'key', or even charge for usage
- They might change the rules as things progress

Google geo provides 2500 free APIs per day.
TWITTER recommends authorization from the company !

In [130]:
# TWITTER
# and you need a KEY to get access

import urllib.request, urllib.parse, urllib.error
import twurl
import json

TWITTER_URL = "https://api.twitter.com/1.1/friends/list.json"

while True :
    print('')
    acct = input('Enter Twitter Account :')
    if (len(acct) < 1) : break
    url = twurl.augment(TWITTER_URL, 
                       {'screen_name' : acct, 'count' : '5'})
    print('Retrieving', url)
    connection = urllib.request.urlopen(url)
    data = connetion.read().decode()
    headers = dict(connection.getheaders())
    print("Remaining", headers['x-rate-limit-remaining'])
    js = json.loads(data)
    print(json.dumps(js, indent=4))
    
    for u in js['users'] :
        print(u['screen_name'])
        s = u['status']['text']
        print('\t', s[:50])

ModuleNotFoundError: No module named 'twurl'

# 4. REST
Representational State Transfer
IS a way of providing interoperability btw computer systems on the Internet.