# Data Analytics

## Python and Web Data Gathering

What we'll look at...
- Python and working with the Internet
    -   Using HTTP with `urllib`
- Data on the Web
    -   Using XML
    -   Using JSON
- BeautifulSoup and Web Scraping



---

### Python and working with the Internet

Communications have come a looong way since someone first had the notion of sending information from point "A" to point "B". 

#### Brief History of long distance communications ...

- **The Telegraph** -- Early constructions of the telegraph started popping up across the world during the 19th century.

- **The Telephone** -- Patented in 1875 by Alexander Graham Bell, the telephone really took off with the invention of the transistor, and electronic switching systems created to connect two parties. The telephone combined the lossless communication of a conversation with the instantaneousness of the telegraph.

- **The Radio** -- Radio waves, also known as Hertzian waves, were first discovered in the late 19th century and were later used for commercial purposes by Guglielmo Marconi in 1896. This led to the development of all sorts of wave technology communications, including microwave signals.

- **Computer Networking** -- In October of 1969, the first data traveled between nodes of the ARPANET, a predecessor of the Internet. This was the first computer network and was invented by Charley Kline and Bill Duvall. 

- **The Internet** -- On January 1, 1983, the Internet was officially born. ARPANET switched its old `network control protocols` (NCP) and the new `Transmission Control Protocol/Internet Protocol` (TCP/IP) became standard. Today the internet allows instant communication, between parties on the other side of the planet, using any number of different devices to do that.


#### Some basic theory...

**TCP Connections / Sockets**
-   The `Transport Control Protocol (TCP)` is build on top of `IP (Internat Protocol)`
-   TCP handles “flow control” using a transmit window and provides a nice reliable pipe
-   A `port` is an application-specific or process-specific software communications endpoint
-   Ports allow multiple networked applications to coexist on the same server
-   There is a list of well-known TCP port numbers EG:- HTTP is on Port 80
-   Sometimes we see the port number in the URL if the web server is running on a “non-standard” port EG:- `www.lasi-asia.org:8080/wp/`
-   Python has built-in support for TCP Sockets using the `socket` library

<br>

![Network Communication Layers](images/NetworkCmmunicationLayers.jpg)

<br>

**HTTP - Hypertext Transfer Protocol**
-   HTTP is the set of rules to allow browsers to retrieve web documents from servers over the Internet
-   HTTP is the dominant Application Layer Protocol on the Internet
-   It was invented for the Web - to Retrieve HTML, Images, Documents, etc.
-   HTML has extended to retrieve data in addition to documents - RSS, Web Services, etc.
-   The basic concept of HTTP - Make a Connection - Request a document - Retrieve the Document - Close the Connection
-   When users click on an `href` tag, the browser makes a connection to the web server and issues a “GET” request - to GET the content of that page at the specified URL
-   The server returns the HTML document to the browser, which formats and displays the document to the user
-   The Python module `socket` elegantly handles HTTP requests in Python

<br>

**About Characters and Strings…**
-   To represent the wide range of characters and character sets computers must be able to handle, we represent characters with more than one byte of data
-   `UTF-8` is recommended practice for encoding data to be exchanged between systems. `UTF-8` uses 1-4 bytes of data to represent each character of data
-   Inside Python 3, all strings are `Unicode encoded`
-   In Python, working with string variables in programs, and reading data from files, usually "just works" because of the power of build in methods and functions
-   When we talk to a network resource using sockets, or talk to a database, we have to `encode` and `decode` data (usually to UTF-8)

<br>

**Making HTTP Easier With `urllib`**
-   Since HTTP is so common, we have a Python library that does all the socket work for us and makes web pages look like a file

-   To use it ... `import urllib.request, urllib.parse, urllib.error`

-   Then create a file handler ... `fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')`

In [None]:
# import & read text from a Web Page using HTTP
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())    # strip leading and trailing blanks

#### Read all about urlLib documentation at ... [HOW TO Fetch Internet Resources Using The urllib Package](https://docs.python.org/3/howto/urllib2.html)

 ---

 ### Data on the Web

- With HTTP Request/Response established, there was a natural move toward exchanging data between programs using this protocol
- There are two commonly used formats for sending Data Across the “Net” a.k.a. “Wire Protocols” - What we send on the “wire”
- The two commonly formats are: `XML` and `JSON`

#### the XML Protocol

Agreeing on a “Wire Format” lead to `XML - eXtensible Markup Language`

It started as a simplified subset of the Standard Generalized Markup Language (SGML), and is designed to be a relatively human-legible way for information systems to share structured data

XML Terminology
-   **Tags** - indicate the beginning and ending of elements
-   **Attributes** - Keyword/value pairs on the opening tag of XML
-   **Serialize / De-Serialize** - Convert data in one program into a common format that can be stored and/or transmitted between systems in a programming language-independent manner
-   **XML Schema** - Describing a “contract” as to what is the acceptable legal format of an XML document
-   **Schema Languages** - Many XML Schema Languages; the most common "standard" is XML Schema from W3C - **XSD** or **“W3C Schema”**
-   **“W3C Schema”** - defines a structure, constraints, data types, and syntax for XML transfers

In [None]:
# basic example of using XML data in Python
import xml.etree.ElementTree as ET

input = '''
    <stuff>
        <users>
            <user x="2">
                <id>001</id>
                <name>Chuck</name>
            </user>
            <user x="7">
                <id>009</id>
                <name>Brent</name>
            </user>
        </users>
    </stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Attribute', item.get("x"))

In [None]:
# Complex example of pulling XML from the internet
import xml.etree.ElementTree as ET
import urllib.request, urllib.parse, urllib.error
import datetime as dt

# pull a list of winning PowerBall numbers since 2010
url = 'https://data.ny.gov/api/views/d6yy-54nr/rows.xml?accessType=DOWNLOAD'
response = urllib.request.urlopen(url).read()

# parse the XML response to build a tree
tree = ET.fromstring(response)
listLookup = tree.findall('row/row')
print('Record count:', len(listLookup))

for item in listLookup:
    drawDate = item.find('draw_date').text
    winningNumber = item.find('winning_numbers').text

    # reformat the date to a prettier format
    isoDate = dt.datetime.fromisoformat(drawDate)
    formatDate = isoDate.strftime("%A %d. %B %Y")

    print('\nDrawing Date: - ', formatDate)
    print('Winning Numbers: - ', winningNumber)

#### Read all about the Python xml.etree library documentation at ... [The ElementTree XML](https://docs.python.org/3/library/xml.etree.elementtree.htm)


---

### Using the JSON Protocol

There is a huge amount of data available on the web and most of it is in (JavaScript Object Notation) JSON.

JSON is a lightweight data format for data interchange which can be easily read and written by humans, and easily parsed and generated by machines. It is a complete language-independent text format. 

Douglas Crockford “Discovered” JSON. JSON represents data as nested “lists” and “dictionaries”in Python.

The syntax of JSON is considered as a subset of the syntax of JavaScript including the following:

-   **Key Name/Value pairs**: -- Represent Data, name is followed by a `:` (colon) and the Name/Value pairs separated by commas
-   **Curly braces**: -- Hold objects
-   **Square brackets**: -- Hold arrays with values separated by commas

The **Keys/Name** must be a string with double quotes, and **Values** must be data types amongst the following:- String, Number, Object (JSON object), array, Boolean, or Null
 
To make it easier for humans to directly read and use JSON, we have different libraries which help us to read the JSON data fetched from the web. 

These libraries -- like `json` -- have objects and functions which help to open the URL from the web and read the data.

In this way, one can easily read a JSON response from a given URL by using `urlopen()` method to get the response, and then use `json.loads()` to convert the response into a JSON object.

Here are the steps of the process by which we can read the JSON response from a link or URL in Python:
-   Import required modules
-   Assign URL
-   Get the response of the URL using `urlopen()`
-   Convert it to a JSON response using `json.loads()`
-   Display the generated JSON response

In [None]:
# example of JSON data with Python
import json

input = '''
[
    { "id" : "001",
        "x" : "2",
        "name" : "Chuck"
    } ,
    { "id" : "009",
        "x" : "7",
        "name" : "Chuck"
    }
]'''

info = json.loads(input)
print('User count:', len(info))

for item in info:
    print("\n")
    print('Name', item['name'])
    print('Id', item['id'])
    print('Attribute', item['x'])

In [None]:
### Here is an example of pulling JSON from the web
# import urllib library and json
from urllib.request import urlopen
import json

# store the URL to import in ourURL as parameter for urlopen
ourURL = "https://jsonplaceholder.typicode.com/users"

# store the response of URL
response = urlopen(ourURL)

# storing the JSON response from url in data
data_json = json.loads(response.read())

# print the unformatted JSON response data
print("\nUnformatted JSON data ...\n", data_json)

# transform the raw JSON into a structured object
data_object = json.dumps(data_json, indent = 4)

# print the formatted JSON Object
print("\nFormatted JSON Object ...\n", data_object)

Here are some more URLs from JSONPlaceholder data which comes with a set of 6 common resources:

-   /posts  --	100 posts	--	https://jsonplaceholder.typicode.com/posts
-   /comments   --	500 comments --	 https://jsonplaceholder.typicode.com/comments
-   /albums --	100 albums -- https://jsonplaceholder.typicode.com/albums
-   /photos --	5000 photos -- https://jsonplaceholder.typicode.com/photos
-   /todos  --	200 todos -- https://jsonplaceholder.typicode.com/todos
-   /users  --	10 users -- https://jsonplaceholder.typicode.com/users

You are welcome to try these out for yourself to see what comes back in these JSON data loads.


#### Read all about Python JSON in this article ... [Working With JSON Data in Python](https://realpython.com/python-json/)


---

### BeautifulSoup and Web Scraping

The incredible amount of data on the Internet is a rich resource for any field of research or personal interest.

To effectively harvest that data, you’ll need to become skilled at web scraping.

#### What is Web Scraping

When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages

Search engines scrape web pages - we call this “spidering the web” or “web crawling”

> Note : Web Scraping is considered as illegal in many cases. It may also cause your IP to be blocked permanently by a website.

Some websites don’t like it when automatic scrapers gather their data, while others don’t mind. 

If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research and make sure that you’re not violating any 'Terms of Service' before you start.

#### Scraping Web Pages

The Python libraries `requests`, `html5lib` and `BeautifulSoup` are powerful tools, perfect for the job of webs craping.

Python `requests` is a  module that allows you to send HTTP requests using Python. The HTTP request returns a 'Response Object' with all the response data (content, encoding, status, etc).

Once we have accessed the HTML content in a 'Response Object', we need to parse the data. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is `html5lib`.

Python `BeautifulSoup` is a  library (from https://www.crummy.com/software/BeautifulSoup/) for pulling data out of `HTML` and `XML` files. This document covers `BeautifulSoup` version 4 that works with Python 3.

#### Installing `BeautifulSoup`, `html5lib` and `requests`

To install these libraries, use PIP.

```python
    python -m pip install requests
    pip install html5lib
    pip install beautifulsoup4
```

#### Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use `requests`

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. For this task, we will use `html5lib`

3. Now, we need to navigate and search the parsed tree that we created, i.e. tree traversal. For this task, we will use `BeautifulSoup`



#### **Step 1.) - Accessing the HTML content from a webpage**

Import the requests library. Then, specify the URL of the webpage you want to scrape.
Send a HTTP request to the specified URL and save the response from the server in a response object called `response`.
Now, as print `response.content` to get the raw HTML content of the webpage. It is of ‘string’ type.

In [None]:
# import the libraries
import requests

# request the URL of the webpage you want to access
URL = "http://www.dr-chuck.com/page1.htm"
response = requests.get(URL)

#  print 'response.content' to get the raw HTML content of the webpage. It's of ‘string’ type
print(response.content)


#### **Step 2.) - Parsing the HTML content**

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So the BeautifulSoup object and specifying the parser library can be done at the same time.

We create a BeautifulSoup object by passing two arguments:
- **response.content** : It is the raw HTML content.
- **html5lib** : Specifying the HTML parser we want to use.

Printing `soup.prettify()` gives the visual representation of the parse tree created from the raw HTML content. 


In [None]:
# request the URL of the webpage you want to access
import requests
from bs4 import BeautifulSoup as soup

URL = "http://www.values.com/inspirational-quotes"
response = requests.get(URL)

# parse the response into a readable form
results = soup(response.content, 'html5lib')
print(results.prettify())

#### **Step 3.) - Searching and navigating through the parse tree**

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. 

In our example, we are scraping a webpage consisting of some quotes. So, we would like to create a program to save those quotes (and all relevant information about them). 

Here below is code to do that ...

In [None]:
#Program to scrape website and save quotes from website
import requests
from bs4 import BeautifulSoup  as bsoup
import csv

# request the URL of the webpage you want to access
URL = "http://www.values.com/inspirational-quotes"
response = requests.get(URL)

# parse the response into a readable form
results = bsoup(response.content, 'html5lib')
# print(soup.prettify())

quotes=[] # a list to store quotes

# search the response for the HTML container that holds the quotes
table = results.find('div', attrs = {'id':'all_quotes'})
# print(table)

# iterate the able rows to find each quote info
for row in table.findAll('div'):
    quote = {}  # create a dictionary for each quote
    quote['url'] = "https:/" + row.a['href']
    quote['lines'] = row.img['alt'].split(" #")[0]
    quote['theme'] = row.h5.a.text
    quote['img'] = row.img['src']
    quotes.append(quote)    # attache each quote to the list

print(quotes) 

# save the quotes list of dictionaries into a CSV file
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
	w = csv.DictWriter(f,['theme','url','img','lines'])
	w.writeheader()
	for quote in quotes:
		w.writerow(quote)

##### **Lets analyze the above code**

- First search through the HTML content of the webpage; print it using `soup.prettify()` method and try to find a pattern or a way to navigate to the quotes.

- The quotes are inside a `div` container whose `id` is ‘all_quotes’. So, we find that div element by using the `find()` method :

            table = soup.find('div', attrs = {'id':'all_quotes'}) 

- The first argument is the HTML `div` tag we want; the second argument is a dictionary type element to specify the additional attributes associated with that tag. 

- The `find()` method returns the first matching element. We can try to print `table.prettify()` to get a sense of what this piece of code does.

- Now, in the table element, one can notice that each quote is inside a div container whose class is quote. So, we iterate through each div container with that class.

- Finally, we use the `findAll()` method which is similar to the `find()` method in terms of arguments but it returns a list of all matching elements. Each quote is now iterated using a variable called row.

- Using the row variable, we find info snippets on each quote to populate a quote dictionary, which is added to the quotes list.

- Finally, save the quotes list of dictionaries into a CSV file.

>Read all the documentation on using the BeautifulSoup library in the [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### A Note on Scraping Dynamic Websites

In this section we learned how to scrape a static website. 

Static sites are straightforward to work with because the server sends you an HTML page that already contains all the page information in the response. You can parse that HTML response and immediately begin to pick out the relevant data.

On the other hand, with a dynamic website, the server might not send back any HTML at all. Instead, you could receive JavaScript code as a response. This code will look completely different from what you saw when you inspected the page with your browser’s developer tools.

Many modern web applications are designed to provide their functionality in collaboration with the clients’ browsers. Instead of sending HTML pages, these apps send JavaScript code that instructs your browser to create the desired HTML. 

Web apps deliver dynamic content in this way to offload work from the server to the clients’ machines as well as to avoid page reloads and improve the overall user experience.

When we use `requests`, we only receive what the server sends back. In the case of a dynamic website, you’ll end up with some JavaScript code instead of HTML. 

The only way to go from the JavaScript code you received to the content that you’re interested in is to execute the code, just like your browser does. The `requests` library can’t do that for you, but there are other solutions that can.

For example, `requests-html` is a project created by the author of the `requests` library that allows you to render JavaScript using syntax that’s similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.

### Closing note

In today’s highly-connected and instantaneous world, we have access to a massive amount of information at our fingertips.

Inter-Connected Media was a huge step forward in that it enabled everyone to be a part of the conversation. On the other hand, algorithms and the sheer amount of content to sift through, has created a lot of downsides as well.

Between 2015 and 2025, the amount of data captured, created, and replicated globally will increase by 1,600%.

Read this article on [The Evolution of Media: Visualizing a Data-Driven Future](https://www.visualcapitalist.com/evolution-of-media-data-future/) and specifically look at the info-graphic that goes with the article.
