# Chapter 9 - Getting Data

In order to be a data scientist you need data. In fact, as a data scientist you will spend
an embarrassingly large fraction of your time acquiring, cleaning, and transforming
data. In a pinch, you can always type the data in yourself (or if you have minions,
make them do it), but usually this is not a good use of your time. In this chapter, we’ll
look at different ways of getting data into Python and into the right formats.

## stdin and stdout

If you run your Python scripts at the command line, you can pipe data through them
using sys.stdin and sys.stdout. For example, here is a script that reads in lines of
text and spits back out the ones that match a regular expression:

In [1]:
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if it matches the regex, write it to stdout
    if re.search(regex,line):
        sys.stdout.write(line)

And here’s one that counts the lines it receives and then writes out the count:

In [2]:
import sys

count = 0 
for line in sys.stdin:
    count += 1
    
# print goes to sys.stdout
print(count)

0


## Reading Files

You can also explicitly read from and write to files directly in your code. Python
makes working with files pretty simple.

### The Basics of Text Files

The first step to working with a text file is to obtain a file object using open:

    # 'r' means read-only
    file_for_reading = open('reading_file.txt', 'r')

    # 'w' is write -- will destroy the file if it already exists!
    file_for_writing = open('writing_file.txt', 'w')

    # 'a' is append -- for adding to the end of the file
    file_for_appending = open('appending_file.txt', 'a')

    # don't forget to close your files when you're done
    file_for_writing.close()
    
Because it is easy to forget to close your files, you should always use them in a with
block, at the end of which they will be closed automatically:

    with open(filename,'r') as f:
    data = function_that_gets_data_from(f)
    
    # at this point f has already been closed, so don't try to use it
    process(data)
    
If you need to read a whole text file, you can just iterate over the lines of the file using
for:

    starts_with_hash = 0
    
    with open('input.txt','r') as f:
    for line in file:               # look at each line in the file
        if re.match("^#",line):     # use a regex to see if it starts with '#'
            starts_with_hash += 1   # if it does, add 1 to the count
            
Every line you get this way ends in a newline character, so you’ll often want to
strip() it before doing anything with it.

For example, imagine you have a file full of email addresses, one per line, and that
you need to generate a histogram of the domains. The rules for correctly extracting
domains are somewhat subtle (e.g., the Public Suffix List), but a good first approxi‐
mation is to just take the parts of the email addresses that come after the @. (Which
gives the wrong answer for email addresses like joel@mail.datasciencester.com.)

    def get_domain(email_address):
        """split on '@' and return the last piece"""
        return email_address.lower().split("@")[-1]
    
    with open('email_addresses.txt', 'r') as f:
        domain_counts = Counter(get_domain(line.strip()) for line in f if "@" in line)

### Delimited Files

The hypothetical email addresses file we just processed had one address per line.
More frequently you’ll work with files with lots of data on each line. These files are
very often either comma-separated or tab-separated. Each line has several fields, with
a comma (or a tab) indicating where one field ends and the next field starts.

This starts to get complicated when you have fields with commas and tabs and newlines in them (which you inevitably do). For this reason, it’s pretty much always a mistake to try to parse them yourself. Instead, you should *__use Python’s csv module (or the pandas library). For technical reasons that you should feel free to blame on Microsoft, you should always work with csv files in binary mode by including a b after the r or w__* (see Stack Overflow).

If your file has no headers (which means you probably want each row as a list, and
which places the burden on you to know what’s in each column), you can use
csv.reader to iterate over the rows, each of which will be an appropriately split list.

For example, if we had a tab-delimited file of stock prices:

    6/20/2014 AAPL 90.91
    6/20/2014 MSFT 41.68
    6/20/2014 FB 64.5
    6/19/2014 AAPL 91.86
    6/19/2014 MSFT 41.51
    6/19/2014 FB 64.34
    
we could process them with:

In [10]:
import csv

with open('tab_delimited_stock_prices.txt', 'rb') as f:
    reader = read_csv(f, delimiter='\t')
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date, symbol, closing_price)

## Using APIs

Many websites and web services provide application programming interfaces (APIs),
which allow you to explicitly request data in a structured format. This saves you the
trouble of having to scrape them!

### JSON (and XML)

Because HTTP is a protocol for transferring text, the data you request through a web
API needs to be serialized into a string format. Often this serialization uses JavaScript
Object Notation (JSON). JavaScript objects look quite similar to Python dicts, which
makes their string representations easy to interpret:

    { "title" : "Data Science Book",
    "author" : "Joel Grus",
    "publicationYear" : 2014,
    "topics" : [ "data", "science", "data science"] }
    
We can parse JSON using Python’s json module. In particular, we will use its loads
function, which deserializes a string representing a JSON object into a Python object:

In [18]:
import json

serialized = """{ "title" : "Data Science Book",
                  "author" : "Joel Grus",
                  "publicationYear" : 2014,
                  "topics" : [ "data", "science", "data science"] }"""

# Parse the JSON to create a Python dict.
deserialized = json.loads(serialized)
if "data science" in deserialized["topics"]:
    print(deserialized)

{'title': 'Data Science Book', 'author': 'Joel Grus', 'publicationYear': 2014, 'topics': ['data', 'science', 'data science']}


Sometimes an API provider hates you and only provides responses in XML:

    <Book>
        <Title>Data Science Book</Title>
        <Author>Joel Grus</Author>
        <PublicationYear>2014</PublicationYear>
        <Topics>
            <Topic>data</Topic>
            <Topic>science</Topic>
            <Topic>data science</Topic>
        </Topics>
    </Book>
    
You can use BeautifulSoup to get data from XML similarly to how we used it to get
data from HTML; check its documentation for details.

### Using an Unauthenticated API

Most APIs these days require you to first authenticate yourself in order to use them.
While we don’t begrudge them this policy, it creates a lot of extra boilerplate that
muddies up our exposition. Accordingly, we’ll first take a look at GitHub’s API, with
which you can do some simple things unauthenticated:

In [19]:
import requests, json

endpoint = "https://api.github.com/users/joelgrus/repos"

repos = json.loads(requests.get(endpoint).text)

In [20]:
repos

[{'id': 112873601,
  'node_id': 'MDEwOlJlcG9zaXRvcnkxMTI4NzM2MDE=',
  'name': 'advent2017',
  'full_name': 'joelgrus/advent2017',
  'private': False,
  'owner': {'login': 'joelgrus',
   'id': 1308313,
   'node_id': 'MDQ6VXNlcjEzMDgzMTM=',
   'avatar_url': 'https://avatars1.githubusercontent.com/u/1308313?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/joelgrus',
   'html_url': 'https://github.com/joelgrus',
   'followers_url': 'https://api.github.com/users/joelgrus/followers',
   'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}',
   'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions',
   'organizations_url': 'https://api.github.com/users/joelgrus/orgs',
   'repos_url': 'https://api.github.com/users/joelgrus/repos',
   'events_url': 'https://api.github.com/users/

At this point repos is a list of Python dicts, each representing a public repository in
my GitHub account. (Feel free to substitute your username and get your GitHub
repository data instead. You do have a GitHub account, right?)
We can use this to figure out which months and days of the week I’m most likely to
create a repository. The only issue is that the dates in the response are (Unicode)
strings:

    u'created_at': u'2013-07-05T02:02:28Z'
    
Python doesn’t come with a great date parser, so we’ll need to install one:

In [21]:
pip install python-dateutil

Note: you may need to restart the kernel to use updated packages.


from which you’ll probably only ever need the dateutil.parser.parse function:

In [23]:
from dateutil.parser import parse
from collections import Counter

dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

In [25]:
month_counts

Counter({12: 4, 11: 5, 2: 3, 1: 3, 9: 4, 7: 4, 5: 3, 6: 1, 8: 2, 4: 1})

In [26]:
weekday_counts

Counter({5: 4, 4: 7, 6: 4, 1: 5, 2: 7, 3: 2, 0: 1})

Similarly, you can get the languages of my last five repositories:

In [27]:
last_5_repositories = sorted(repos, key=lambda r: r["created_at"], reverse=True)[:5]

In [28]:
last_5_repositories

[{'id': 225098708,
  'node_id': 'MDEwOlJlcG9zaXRvcnkyMjUwOTg3MDg=',
  'name': 'advent2019',
  'full_name': 'joelgrus/advent2019',
  'private': False,
  'owner': {'login': 'joelgrus',
   'id': 1308313,
   'node_id': 'MDQ6VXNlcjEzMDgzMTM=',
   'avatar_url': 'https://avatars1.githubusercontent.com/u/1308313?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/joelgrus',
   'html_url': 'https://github.com/joelgrus',
   'followers_url': 'https://api.github.com/users/joelgrus/followers',
   'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}',
   'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions',
   'organizations_url': 'https://api.github.com/users/joelgrus/orgs',
   'repos_url': 'https://api.github.com/users/joelgrus/repos',
   'events_url': 'https://api.github.com/users/

In [29]:
last_5_languages = [repo["language"] for repo in last_5_repositories]

In [30]:
last_5_languages

['Python', None, 'Python', 'Python', 'HTML']

Typically we won’t be working with APIs at this low “make the requests and parse the
responses ourselves” level. One of the benefits of using Python is that someone has
already built a library for pretty much any API you’re interested in accessing. When
they’re done well, these libraries can save you a lot of the trouble of figuring out the
hairier details of API access. (When they’re not done well, or when it turns out they’re
based on defunct versions of the corresponding APIs, they can cause you enormous
headaches).

Nonetheless, you’ll occasionally have to roll your own API-access library (or, more
likely, debug why someone else’s isn’t working), so it’s good to know some of the
details.

## Finding APIs

If you need data from a specific site, look for a developers or API section of the site
for details, and try searching the Web for “python __ api” to find a library. There is a
Rotten Tomatoes API for Python. There are multiple Python wrappers for the Klout
API, for the Yelp API, for the IMDB API, and so on.

If you’re looking for lists of APIs that have Python wrappers, two directories are at
Python API and Python for Beginners.

If you want a directory of web APIs more broadly (without Python wrappers necessarily), a good resource is Programmable Web, which has a huge directory of catego‐
rized APIs.

And if after all that you can’t find what you need, there’s always scraping, the last refuge of the data scientist.