# Getting Data - Part 3


Information comes from Ch 9 of Data Science from Scratch, 2nd Edition by Joel Grus.  This book is available for free through the library's connection to O'reilly's learning platform.

##  What we have learned so far?

We have looked at **increasing levels of abstraction**. 

**Read and write to stdin and stdout**  

```
import sys

# for every line read in from stdin
for line in sys.stdin:
    sys.stdout.write(line)
```

**Read and write from/to a file**

```
f = open(‘testfile.txt’, ‘r’)

fw = open(‘testfilewrite.txt’, ‘w’)

fw.close()
```

**Read and write from/to a file**

```
f = open(‘testfile.txt’, ‘r’)

f.readlines()

f.read()

for line in f:
    print(line)
```

**Read and write from/to delimited files**

```
import csv

f = open(‘tab_file.txt’, ‘rb’)
reader =csv.reader(f,delimiter=’\t’)
for row in reader:

f = open(‘colon_file.txt’, ‘rb’)
reader = csv.DictReader(f, delimiter=’:’)
```

**Read and write from/to delimited files**

```
import csv

f = open('data/comma_test.txt','wb')

writer = csv.writer(f,delimiter=',')

writer.writerow([time.strftime("%m/%d/%Y"),stock,price])
```

**Read and write data with `pandas`**

```
read_csv

read_table

read_fwf

read_clipboard

to_csv('filename')
```

**Web scraping**   
Get data using `requests` and parse it using `BeautifulSoup` and some regular expressions / string manipulation.

```
from bs4 import BeautifulSoup
import requests 

url = "http://example.com"
html = requests.get(url).text 
soup = BeautifulSoup(html, 'html5lib')
```

### Today, you will learn about: 

* The JSON format, how to parse it 
* Use of APIs to get data from websites and services 

Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a structured format. 

*When APIs are available, they should be used as opposed to scrapping information.*


In [None]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl 
%matplotlib inline  

## JSON

HTTP is a protocol for transferring text, the data you request through a web API needs to be serialized into a string format. Often this serialization uses **JavaScript Object Notation (JSON)**. 

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).



### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=50%></center>

In [None]:
import json
from pathlib import Path

f = Path('data') / 'family.json'
family_tree = json.loads(f.read_text())
family_tree

In [None]:
# We can access the nested information 
family_tree['children']

In [None]:
...

In [None]:
...

### Using the `json` module

Let's process the same file using the `json` module. Note:
- `json.load(f)` loads a JSON file from a file object.
- `json.loads(f)` loads a JSON file from a **s**tring.


#### Handling _unfamiliar_ data

- Never trust data from an unfamiliar site.

- **Never** use `eval` on "raw" data that you didn't create!

- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

### Example: loads( )
We can parse JSON using Python’s `json` module. In particular, we will use its `loads` function, which deserializes a string representing a JSON object into a Python object.

In [None]:
serialized = """{ "title" : "Data Science Book",
                  "author" : "Joel Grus",
                  "publicationYear" : 2014,
                  "topics" : [ "data", "science", "data science"] }"""

# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
deserialized

We can then use the deserialized object like a Python dict object to find information.  For example, does the book cover the topic of "data science":

In [None]:
type(serialized)

In [None]:
type(deserialized)

In [None]:
if "data science" in deserialized["topics"]:
    print("Yes")
else: 
    print("No")

In [None]:
deserialized.keys()

### Example: dumps( )

The `dumps` function will take a Python object (e.g., dict) and serializes it into a JSON formatted string. 

In [None]:
data = {
   'name' : 'ACME',
   'shares' : 100,
   'price' : 542.23
}
json_obj = json.dumps(data)
json_obj

In [None]:
type(data)

In [None]:
type(json_obj)

In [None]:
data2 = json.loads(json_obj)
data2

In [None]:
type(data2)

In [None]:
help(json.dumps)

### Example  
*Ref: https://pythonspot.com/en/json-encoding-and-decoding-with-python/*

In [None]:
# Convert JSON to Python Object, then iterate
array = '{"drinks": ["coffee", "tea", "water"]}'
data = json.loads(array)
 
for element in data['drinks']:
    print(element)

In [None]:
json_data = '{"name": "Brian", "city": "Seattle"}'
python_obj = json.loads(json_data)
print(python_obj["name"])
print(python_obj["city"])

In [None]:
obj = {
    "persons": [
        {
            "city": "Seattle", 
            "name": "Brian"
        }, 
        {
            "city": "Amsterdam", 
            "name": "David"
        }
    ]
}
obj

### Hierarchical Data  
*Example from Data100*

A lot of structured data isn't in CSV format, but in HTML, XML, JSON, YAML, etc. JSON might have a structure that Pandas can't read directly.

Here's an example: a group of people collected information about US congressional legislators in YAML format.

https://github.com/unitedstates/congress-legislators

Here's one of the data files:

https://github.com/unitedstates/congress-legislators/blob/master/legislators-current.yaml

YAML is a data serialization language commonly used in configuration, for more information see https://en.wikipedia.org/wiki/YAML

In [None]:
import requests
from pathlib import Path

legislators_path = 'legislators-current.yaml'
base_url = 'https://theunitedstates.io/congress-legislators/'

def download(url, path):
    """Download the contents of a URL to a local file."""
    path = Path(path) # If path was a string, now it's a Path
    if not path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(url)
        with path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
        
download(base_url + legislators_path, legislators_path)

The code above will download the YAML file storing current legislators information and store it locally.  

Then we can just open the local file to look at the information.  

Note, we can also see the file in the files directory and look at it. 

In [None]:
import yaml

legislators = yaml.load(open(legislators_path), Loader=yaml.SafeLoader)
len(legislators)

In [None]:
type(legislators)

Let's create a function to select out the legislator's birthday as a function. 

In [None]:
from datetime import datetime

def to_date(s):
    return datetime.strptime(s, '%Y-%m-%d')

#to_date('2020-10-06')
to_date(x['bio']['birthday'])

We can create a data frame consisting of the legislator's id, first name, last name and birthday. 

In [None]:
leg_df = pd.DataFrame(
    columns=['id', 'first', 'last', 'birthday'],
    data=[[l['id']['bioguide'], 
           l['name']['first'],
           l['name']['last'],
           to_date(l['bio']['birthday'])] for l in legislators])
leg_df.head()

In [None]:
leg_df.dtypes

We could also add their age. 

In [None]:
datetime.now() - leg_df.loc[0, 'birthday']

### Aside:  Lambda functions

A lambda function is a small anonymous function that can take any number of arguments, but can only have one expression. 

It has the following syntax: 

`lambda` *arguments* : *expression*

The expression is executed and the result returned.

#### Example   
A lambda function that adds 10 to the number passed in as an argument.

In [None]:
# Example 
x = lambda a: a+10
print(x(5))

#### Example   
A lambda function that takes two inputs and multiplies them together.

In [None]:
# Example 
x = lambda a, b : a*b
print(x(5,6))

#### Example  
Apply the lambda function to an argument by surrounding the function and argument in parentheses:

In [None]:
(lambda x: x + 1)(2)

#### Example  
lambda function are often used with other methods in Python such as `apply`, `filter`, `map`, `sorted`, etc.

In [None]:
ids = ['id1', 'id2', 'id30', 'id3', 'id22', 'id100']
print(sorted(ids)) # Lexicographic sort

In [None]:
sorted_ids = sorted(ids, key=lambda x: int(x[2:])) # Integer sort
print(sorted_ids)

#### Example   
Here is an example using the `map` function, which expects a function object and any number of iterables, such as a list, dictionary, etc.   `map` executes the function_object for each element in the sequence and returns a list of the elements modified by the function object. 

In [None]:
def multiply2(x): 
    return x * 2 

x = map(multiply2, [1, 2, 3, 4])
print(x)

In [None]:
def print_iterator(it):
    for x in it:
        print(x, end=' ')
    print('')

print_iterator(x)

In [None]:
mp_it = map(lambda x : x * 2, [1, 2, 3, 4])
print_iterator(mp_it)

In [None]:
list_numbers = [1, 2, 3, 4]
tuple_numbers = (5, 6, 7, 8)
map_iterator = map(lambda x, y: x * y, list_numbers, tuple_numbers)
print_iterator(map_iterator)

## APIs 

Most APIs these days require you to first authenticate yourself in order to use them. This creates a lot of extra boilerplate that muddies up our exposition. 

An application programming interface (API) is a service that makes data directly available to the user in a convenient fashion.

Advantages:

- The data are usually clean, up-to-date, and ready to use.

- The presence of a API signals that the data provider is okay with you using their data.

- The data provider can plan and regulate data usage.
    - Some APIs require you to create an API "key", which is like an account for using the API.
    - APIs can also give you access to data that isn't publicly available on a webpage.
    
<br>

Big disadvantage: APIs don't always exist for the data you want!


### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`. To do this, we need to make a request to the correct URL.

In [None]:
def create_url(pokemon):
    return f'https://pokeapi.co/api/v2/pokemon/{pokemon}'

create_url('squirtle')

In [None]:
r = requests.get(create_url('squirtle'))
r

Remember, the 200 status code is good! Let's take a look at the **content**:

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [None]:
rr = r.json()
rr

Let's try a `GET` request for `billy`

In [None]:
r = requests.get(create_url('billy'))
r

We receive a 404 error, since there is no Pokemon named `'billy'`!

### A Simple API example 

Let's try using the [OpenNotify API](http://open-notify.org/) that serves NASA data. 

Let's use the GET request to see what data we can get in response. Get request takes the URL, in our case the url to Open Notify. Lets make a request and print what is returned. When we just make a request to a url without the right endpoint, we get the html content as response. **End points** are locations of the resources.

In [None]:
request = requests.get('http://api.open-notify.org')
print(request.text)

In [None]:
print(request.status_code)

Let's try to request something from the API for an end point that doesn't exist. 

In [None]:
request2 = requests.get('http://api.open-notify.org/fake-endpoint')
print(request2.status_code)

We get a 404 Error.  

Now let's try for a real end point.  For example, we can get the current location to the International Space Station, with the endpoint `/iss-now.json`.  Alternatively, `/iss-pass.json` returns the time at which the space station passes overhead. 

In [None]:
issLoc = requests.get('http://api.open-notify.org/iss-now.json')
print(issLoc.text)

obj = json.loads(issLoc.text)

print(obj['timestamp'])
print(obj['iss_position']['latitude'], obj['iss_position']['latitude'])

The API description tells use how to request and what the expected output will be. 

In this case the data returned has the following format: 

```
{
  "message": "success", 
  "timestamp": UNIX_TIME_STAMP, 
  "iss_position": {
    "latitude": CURRENT_LATITUDE, 
    "longitude": CURRENT_LONGITUDE
  }
}
```

Let's now convert the `UNIX_TIME_STAMP` to a time that is readable using the `datetime` module. 


In [None]:
print(datetime.utcfromtimestamp(obj['timestamp']).strftime('%Y-%m-%d %H:%M:%S'))

Another endpoint available gets information on the people in space. 

In [None]:
people = requests.get('http://api.open-notify.org/astros.json')
print(people.text)

In [None]:
people_json  = people.json()
print(people_json)

We can make use of the json `dumps()` function optional arguments to *pretty print* the JSON array elements and object members.

In [None]:
people_json_obj = json.dumps(people.json(), indent=2)
print(people_json_obj)

In [None]:
#To print the number of people in space
print("Number of people in space:",people_json['number'])
#To print the names of people in space using a for loop
for p in people_json['people']:
    print(p['name'])

### Github API example

We’ll take a look at [GitHub’s API](https://developer.github.com/v3/), with which you can do some simple things unauthenticated. 

Here we can look at all the repository's for user joelgrus, `Data Science from Scratch` author.

The API documentation specifies the form of the query: 
`GET /users/:username/repos`

https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repositories-for-a-user

In [None]:
resp = requests.get('https://api.github.com/users/joelgrus/repos')
resp

In [None]:
repos = json.loads(resp.text)
repos

To pretty print the information we again make use of the `dumps()` function with the indent argument. 

In [None]:
repos_obj = json.dumps(repos, indent=2) # alternatively call json.dumps(repos, indent=2)
print(repos_obj)

In [None]:
len(repos)

At this point `repos` is a list of Python `dicts`, each representing a public repository in Joel Grus's GitHub account. (Feel free to substitute your username and get your GitHub repository data instead.)


Let's look at the languages in the 5 most recently created repositories. 

We could also specify certain parameters, e.g., whether you are a owner or member. 

In [None]:
resp = requests.get('https://api.github.com/users/joelgrus/repos?type=member')
repos2 = json.loads(resp.text)
print(len(repos2))
repos2

If we look at the url: 
https://api.github.com/users/joelgrus/repos?type=member

`repos` is the endpoint, we use the `?` symbol to apply the constraints or specify the parameters. 

We could do the same with the following code. 

Look at the parameter options in the documentation: 
https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repositories-for-a-user

In [None]:
params = {"type": "member"}
resp = requests.get('https://api.github.com/users/joelgrus/repos', params)
repos2 = json.loads(resp.text)
repos2

#### Other endpoints 

Many other pieces of information can be received using the API, looking at other end points. 

For instance, we can download a single repository. [Github API documentation](https://docs.github.com/en/free-pro-team@latest/rest/reference/repos#get-a-repository)





In [None]:
resp = requests.get('https://api.github.com/repos/octocat/hello-world')
repos3 = json.loads(resp.text)
repos3

### Another example with GitHub API

Here we are looking at the issues for a given repository: 
https://docs.github.com/en/free-pro-team@latest/rest/reference/issues#list-repository-issues

We are also making use of the `pandas` `read_json` command to directly read the JSON output from the API into a dataframe. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

In [None]:
df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5')

In [None]:
df[['created_at', 'title', 'body', 'comments']]

In [None]:
res = df[['created_at', 'title', 'body', 'comments']].head()
res.to_json()

In [None]:
df.keys()

#### Finding APIs

If you need data from a specific site, look for a developers or API section of the site for details, and try searching the Web for “python __ api” to find a library.

If you’re looking for lists of APIs that have Python wrappers, two directories are at [Python for Beginners](http://bit.ly/1L35VOR).
