In [1]:
import sys
ipython = get_ipython()
def exception_handler(exception_type, exception, traceback):
    print("%s: %s" % (exception_type.__name__, exception), file=sys.stderr)
ipython._showtraceback = exception_handler

## Collecting data from web-based sources

With those general caveats in mind, let's dive a bit more deeply into the specific case of gather data from an web-based source, which is one of the more common forms of querying data.  It will also serve as an introduction to the type of Python coding that you'll do in this class.

The first step of collecting web-based data is to issue a request for this data via some protocol: HTTP (HyperText Transfer Protocol) or HTTPS (the secure version).  And while I know that one of the principles of this course is to teach you how things work "under the hood" as well the common tools for doing so, we won't be concerned at all with the actual HTTP protocol or how these methods work in any detail; for our purposes, we're going to use the [requests](http://docs.python-requests.org/en/master/) library in Python.

Let's see how this works with some code.  The following code will load data from the course webpage:

In [2]:
import requests
response = requests.get("http://www.cmu.edu")

print("Status Code:", response.status_code)
print("Headers:", response.headers)

Status Code: 200
Headers: {'Date': 'Mon, 05 Aug 2019 17:35:47 GMT', 'Server': 'Apache', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'x-frame-options': 'SAMEORIGIN', 'Vary': 'Referer', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=7200, must-revalidate', 'Expires': 'Mon, 05 Aug 2019 19:35:47 GMT', 'Keep-Alive': 'timeout=5, max=500', 'Connection': 'Keep-Alive', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html'}


In [3]:
print(response.text[:480])

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8"/>
    <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <title>Homepage -     CMU - Carnegie Mellon University</title>    
    <meta content="CMU is a global research university known for its world-class, interdisciplinary programs: arts, business, computing, engineering, humanities, policy and science." name="description"/>
  


In [4]:
params = {"query": "python download url content", "source":"chrome"}
response = requests.get("http://www.google.com/search", params=params)
print(response.status_code)

200


Besides the HTTP GET command, there are other common HTTP commands (POST, PUT, DELETE) which can also be called by the corresponding function in the library.

### RESTful APIs

In [5]:
# Get your own at https://github.com/settings/tokens/new
token = "3125e4430a58c5259a14ddd48157061cdb7055c0" 
response = requests.get("https://api.github.com/user", params={"access_token":token})

print(response.status_code)
print(response.headers["Content-Type"])
print(response.json().keys())

401
application/json; charset=utf-8
dict_keys(['message', 'documentation_url'])


### Authentication

Most APIs will use an authentication procedure that is more involved than this example above.  The standard here for a while was called "Basic Authentication", and can be used via the requests library by simply passing the login and password as the auth argument to the relevant calls, as below. 

In [6]:
response = requests.get("https://api.github.com/user", auth=("zkolter", "github_password"))
print(response.status_code)

401


## Common data formats and handling

Now that you've obtained some data (either by requesting it from a web source, or just getting a file sent to you), you'll need to know how to handle that data format.  Obviously, data comes in many different formats, but some of the more common ones that you'll deal with as a data scientist are:

- CSV (comma separated value) files
- JSON (Javascript object notation) files and string
- HTML/XML (hypertext markup language / extensible markup language) files and string


### CSV files

The "CSV" name is really a misnomer: CSV doesn't only refer to comma separated values, but really refers to any delimited text file (for instance, fields could be delimited by spaces or tabs, or any other character, specific to the file).  For example, let's take a look at the following data file describing weather data near at Pittsburg airport:

In [8]:
import pandas as pd
dataframe = pd.read_csv("H:\SELF\Yashu\Courses\Data Science\L2 Data Collection and Scrapping\data_collection\data_collection\kpit_weather.csv", delimiter=",", quotechar='"')
dataframe.head()

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
0,20170820040000,20170820000000,178,172,10171,0,0,0,0,-9999,,
1,20170820050000,20170820010000,178,172,10177,0,0,0,0,-9999,,
2,20170820060000,20170820020000,167,161,10181,0,0,0,0,-9999,,
3,20170820070000,20170820030000,161,161,10182,0,0,4,0,-9999,,
4,20170820080000,20170820040000,156,156,10186,180,15,-9999,0,-9999,,


### JSON data

Although originally built as a data format specific to the Javascript language, JSON (Javascript Object Notation) is another extremely common way to share data.  We've already seen in it with the GitHub API example above, but very briefly, JSON allows for storing a few different data types:

- Numbers: e.g. `1.0`, either integers or floating point, but typically always parsed as floating point
- Booleans: `true` or `false` (or `null`)
- Strings: `"string"` characters enclosed in double quotes (the `"` character then needs to be escaped as `\"`)
- Arrays (lists): `[item1, item2, item3]` list of items, where item is any of the described data types
- Objects (dictionaries): `{"key1":item1, "key2":item2}`, where the keys are strings and item is again any data type

Note that lists and dictionaries can be nested within each other, so that, for instance

    {"key1":[1.0, 2.0, {"key2":"test"}], "key3":false}

would be a valid JSON object.

Let's look at the full JSON returned by the GitHub API above:

In [9]:
print(response.content)

b'{"message":"Bad credentials","documentation_url":"https://developer.github.com/v3"}'


In [10]:
import json
print(json.loads(response.content))

{'message': 'Bad credentials', 'documentation_url': 'https://developer.github.com/v3'}


In [11]:
data = {"a":[1,2,3,{"b":2.1}], 'c':4}
json.dumps(data)

'{"a": [1, 2, 3, {"b": 2.1}], "c": 4}'

In [12]:
json.dumps(response)

TypeError: Object of type 'Response' is not JSON serializable


### XML/HTML

Last, another format you will likely encoder are XML/HTML documents, though my assessment XML seems to be loosing out to JSON as a generic format for APIs and data files, at least for cases where JSON will suffice, mainly because JSON is substantially easier to parse.  XML files contain hierarchical content delineated by tags, like the following:

In [13]:
from bs4 import BeautifulSoup

root = BeautifulSoup("""
<tag attribute="value">
    <subtag>
        Some content for the subtag
    </subtag>
    <openclosetag attribute="value2"/>
    <subtag>
        Second one
    </subtag>
</tag>
""", "lxml-xml")

print(root, "\n")
print(root.tag.subtag, "\n")
print(root.tag.openclosetag.attrs)

<?xml version="1.0" encoding="utf-8"?>
<tag attribute="value">
<subtag>
        Some content for the subtag
    </subtag>
<openclosetag attribute="value2"/>
<subtag>
        Second one
    </subtag>
</tag> 

<subtag>
        Some content for the subtag
    </subtag> 

{'attribute': 'value2'}


In [14]:
print(root.tag.find_all("subtag"))

[<subtag>
        Some content for the subtag
    </subtag>, <subtag>
        Second one
    </subtag>]


The nice thing about the `find_all` function is that you can call it at previous levels in the tree, and it will recurse down the whole document.  So we could have just as easily done.