## Chapter 9. Getting Data

### stdin, stdout
 
If running Python scripts at the command line, you can **pipe** (`|`) data through them using `sys.stdin` and `sys.stdout`. 
Ex: Scripts that reads in lines of text + spits back out the ones that match a RegEx and counts the lines it recieves + writes out that count

In [1]:
# egrep.py
import sys, re

# sys.argv = list of CLI args
# sys.argv[0] = name of the program itself
# sys.argv[1] = RegEx specified at the CL
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if matches RegEx, write to stdout
    if re.search(regex,line):
        sys.stdout.write(line)
        
# line_count.py
import sys

count = 0
for line in sys.stdin:
    count += 1
# print goes to stdout
print(count)

0


Can use these to count how many lines in a file contain a #

In [2]:
# windows
#!type someFile.txt | python egrep.py "[0-9]" | python line_count.py

# linunx
#!cat someFile.txt | python egrep.py "[0-9]" | python line_count.py

In [3]:
# script that counts words in its input + writes out most common ones:
# most_common_words.py
import sys
from collections import Counter

# pass in number of words as 1st arg
try:
    num_words = int(sys.argv[1])
except:
    print("usage: most_common_words.py num_words")
    sys.exit(1) # non-zero exit code = indicates error

counter = Counter(word.lower()
                 for line in sys.stdin
                 for word in line.strip().split() # split on space 
                 if word) # skip empty words

for word,count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")

### Ex of above script
## type the_bible.txt | python most_common_words.py 10
# 64193 the
# 51380 and
# 34753 of
# 13643 to
# 12799 that
# 12560 in
# 10263 he
# 9840 shall
# 8987 unto
# 8836 for

usage: most_common_words.py num_words


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### Reading Files

Can also explicitly read from + write to files directly in Python code.

### Basics of Text Files

1st step to working with a text file = **obtain a *file object* via `open`**

In [4]:
# 'r' = read only
file_for_reading = open("reading_file.txt", "r")

# 'w' = write = ***DESTROYS files if it already exits****
file_for_writing = open("writing_file.txt", "w")

# 'a' = append (to end of file)
file_for_appending = open("appending_file.txt", "a")

# close files when done
file_for_writing.close()

FileNotFoundError: [Errno 2] No such file or directory: 'reading_file.txt'

It's easy to forget to close files, so always use them in a **`with` block**, at the end of which they will be closed automatically:

In [5]:
with open(filename,'r') as f:
    data=function_to_get_data_from(f)
    
# at this point, file has been closed, don't try to use it
process(data)

NameError: name 'filename' is not defined

To read over whole text file, iterate over lines with `for`

In [6]:
starts_with_hash = 0

with open("input.txt", "r") as f:
    for line in file:
        if re.match("^#",line): # check if each line starts with # and count if True
            starts_with_hash += 1

FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

Every line you get this way ends in a **newline** character, so you’ll often want to `strip()` it before doing anything with it.

Ex: You have a file full of email addresses, 1 per line, that you need to generate a histogram of the domains. The rules for correctly extracting domains are somewhat subtle (e.g., the Public Suffix List), but a good 1st approximation = just take the parts of the email addresses that come after the @ (Which gives the wrong answer
for email addresses like "@mail.datasciencester.com")

In [7]:
def get_domain(email_address):
    """Split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]

with open("email_address.txt", "r") as f:
    domain_counts - Counter(get_domain(line.strip())
                           for line in f
                           if "@" in line)

FileNotFoundError: [Errno 2] No such file or directory: 'email_address.txt'

### Delimited Files

More often we work w/ files w/ lots of data on each line that're very often either comma or tab-separated. Each line has several fields, w/ a comma/tab indicating where 1 field ends + the next starts.

This gets complicated when you have fields w/ commas + tabs + newlines in them. For this reason, it’s pretty much always a mistake to try to parse them yourself. Instead, use Python’s `csv` module (or `pandas` library).

For technical reasons, always work w/ CSV files in **binary mode** by including a `b` after the `r` or `w` (see Stack Overflow).

If your file has no headers (which means you probably want each row as a list, + which places the burden on you to know what’s in each column), you can use `csv.reader` to iterate over rows, each of which will be an appropriately split list.

Ex: TSV of stock prices:

In [8]:
import csv

def process(date, symbol, price):
    print(date, symbol, price)

with open("tab_delimited_stock_prices.txt", "r") as f:
    reader = csv.reader(f, delimiter="\t")
    
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date,symbol,closing_price)

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34


If the file has headers, we can skip them (with an inital call to `reader.next()`) or get each row as a `dict` (with headers as keys) by using `csv.DictReader`)

In [9]:
import csv

def process(date, symbol, price):
    print(date, symbol, price)

with open("colon_delimited_stock_prices.txt", "r") as f:
    reader = csv.DictReader(f, delimiter=":")
    
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        process(date,symbol,closing_price)

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5


Even if the file doesn’t have headers we can still use `DictReader` by passing it the keys as a `fieldnames` parameter, + we can similarly write out delimited data using `csv.writer`:

In [10]:
todays_prices = {"AAPL":90.91,"MSFT":41.68,"FB":64.5}

with open("comma_delimited_stock_prices.txt", "w") as f:
    writer = csv.writer(f, delimiter=",")
    
    for stock,price in todays_prices.items():
        writer.writerow([stock,price])

`csv.writer` does the right thing if fields themselves have commas in them. Your own hand-rolled writer probably won’t. For example, if you attempt:

In [11]:
results = [["test1", "success", "Monday"],
           ["test2", "success, kind of", "Tuesday"],
           ["test3", "failure, kind of", "Wednesday"],
           ["test4", "failure, utter", "Thursday"]]

# BAD - DON'T DO
with open("bad_csv.txt","wb") as f:
    for row in results:
        f.write(",".join(map(str,tow))) # might have too many commas in it
        f.write("\n") # row might already have newlines

NameError: name 'tow' is not defined

You will end up with a csv file no one will ever be able to make sense of that looks like:

* test1,success,Monday
* test2,success, kind of,Tuesday
* test3,failure, kind of,Wednesday
* test4,failure, utter,Thursday


### Scraping the Web

Fetching web pages = easym, getting meaningful structured info out of them = less so

#### HTML and the Parsing Thereof

Pages on the Web = written in HTML, in which text is (ideally) marked up into **elements** + **their attributes:**

In [12]:
'''<html>
<head>
<title>A web page</title>
</head>
<body>
<p id="author">Joel Grus</p>
<p id="subject">Data Science</p>
</body>
</html>'''

'<html>\n<head>\n<title>A web page</title>\n</head>\n<body>\n<p id="author">Joel Grus</p>\n<p id="subject">Data Science</p>\n</body>\n</html>'

In a perfect world where all web pages are marked up semantically for our benefit, we'd be able to extract data using rules like “find the `<p>` element whose `id` = "subject"
+ return the text it contains.” In the actual world, HTML is not generally well-formed,
let alone annotated. This means we’ll need help making sense of it.

To get data out of HTML, use **BeautifulSoup library** = builds a tree out of  the various elements on a web page + provides a simple interface for accessing them + use the **requests library** = a much nicer way of making HTTP requests than anything built into Python, + use **html5lib** as an HTML parser since Python’s built-in HTML parser = not that lenient = AKA doesn’t always cope well w/ HTML that’s not perfectly formed.

To use Beautiful Soup = pass some HTML into `BeautifulSoup()` function, which will come from the result of a call to `requests.get()`:


In [13]:
from bs4 import BeautifulSoup
import requests

html = requests.get("http://www.google.com").text
soup = BeautifulSoup(html,"html5lib")

After this, we can get pretty far using a few simple methods. We’ll typically work with **Tag objects** = correspond to the tags representing the structure of an HTML page.


In [14]:
# find 1st paragraph + its contents
first_paragraph = soup.find('p')

# second way
first_paragraph2 = soup.p

print(first_paragraph)
print(first_paragraph2)

# get text contents of a Tag via its `text` property
first_paragraph_txt = soup.p.text
# split the text into seperate words
first_paragraph_words = soup.p.text.split()
print("\n",first_paragraph_txt,first_paragraph_words)

# extract tag's attributes via treating it like a `dict`
first_p_id = soup.p['id']      # returns KeyError if no 'id'

<p style="color:#767676;font-size:8pt">© 2018 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p>
<p style="color:#767676;font-size:8pt">© 2018 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p>

 © 2018 - Privacy - Terms ['©', '2018', '-', 'Privacy', '-', 'Terms']


KeyError: 'id'

In [15]:
first_p_id2 = soup.p.get('id') # returns None if no 'id'

print("\n",first_p_id2)

# get multiple tags at once
all_paragraphs = soup.find_all('p') # or use soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

print("\n",all_paragraphs,"\n",paragraphs_with_ids)


 None

 [<p style="color:#767676;font-size:8pt">© 2018 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p>] 
 []


Can find tags with a specific class:

In [16]:
impt_paragraphs = soup('p', {'class': 'important'})
impt_paragraphs2 = soup('p','important')
impt_paragraphs3 = [p for p in soup('p')
                    if 'important' in p.get('class', [])]

Can combine these to implement more elaborate logic

In [17]:
## find every <span> element contained inside a <div> element
#     - warning, will return the same span multiple times
#        if it sits inside multiple divs
#     - be more clever if that's the case
spans_inside_div = [span
                   for div in soup('div')
                   for span in div('span')]

Just this handful of features will allow us to do quite a lot. If you end up needing to do
more-complicated things, check the documentation.

Will need to carefully inspect the source HTML, reason through selection logic, + worry about edge cases to make sure your data is correct. Let’s look at an example: **O’Reilly Books About Data**.

A potential investor in our social network thinks data is just a fad. To prove him wrong, you decide to examine how many data books O’Reilly has published over time. After digging
through its website, you find it has many pages of data books + videos, reachable through 30-items-at-a-time directory pages with URLs like:
http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page=1.

Unless you want to be a jerk + you want your scraper to get banned, before scraping data from a website, 1st check to see if it has some sort of access policy. Looking at:
http://oreilly.com/terms/, there seems to be nothing prohibiting this project. In order to be good citizens, we should also check for a `robots.txt` file, which tells webcrawlers how to behave. The important lines in http://shop.oreilly.com/robots.txt are:

* `Crawl-delay: 30` == should wait 30 seconds between requests
* `Request-rate: 1/30` == should request only one page every 30 seconds

Basically 2 different ways of saying the same thing. (There're other lines that indicate directories not to scrape, but they don’t include our URL, so we’re OK.)

* ***NOTE***: There’s always the possibility that O’Reilly will at some point revamp its website + break all logic in this section.

To figure out how to extract the data, let’s download one of those pages and feed it to
Beautiful Soup:

In [18]:
import requests
from bs4 import BeautifulSoup

url = "https://ssearch.oreilly.com/?q=data"
soup = BeautifulSoup(requests.get(url).text,'html5lib')

If we view the source of the page , we’ll see each book (or video) seems to be uniquely contained in a `<td>` table cell element whose class = `thumbtext`. Therefore, a good first step = find all `td` thumbtext tag elements:

In [19]:
articles = soup('article','result product-result')
print(len(articles))

15


Next we’d like to filter out videos. (The would-be investor is only impressed by
books.) If we inspect the HTML further, we see that each article contains 1+ `a` elements whose `class = "book"`, and whose text looks like `Ebook`: or `Video`: or
`Print`:. It appears the videos contain only one `pricelabel`, whose text starts with
`Video` (after removing leading spaces). This means we can test for videos with:

In [20]:
def is_video(article):
    """Is video if only 1 pricelabel + if stripped text inside 
    pricelabel starts w/ 'Video'"""
    pricelabel = article('a','book')
    return (len(pricelabel)==1 and
           pricelabel[0].text.strip().startswith("Video"))

print(len([article for article in articles if not is_video(article)]))

15


Now we’re ready to start pulling data out of the td elements. It looks like the book title is the text inside an `<a>` tag withing an `<p>` class="title"> tag inside the `<div class="book_text">`:

In [21]:
titles = list(article.find("div","book_text").a.text.strip() for article in articles)
print(titles)

['Interactive Data Visualization for the Web', 'Data Preparation in the Big Data Era', 'Data Driven: Creating a Data Culture', 'Selenium Framework Design in Data-Driven Testing', 'The Big Data Market', 'Oil, Gas, and Data', 'Going Pro in Data Science', 'Data and Social Good', 'Understanding the Chief Data Officer', 'Mapping Big Data', 'The Security Data Lake', 'Not All Data Is Created Equal', '2015 Data Science Salary Survey', 'Managing the Data Lake', 'Python for Data Developers']


The author(s) are in the text of the "note" `<p>`, + are prefaced by a `By` + separated by commas

In [22]:
import re # RegEx
print(list(article.find("div","book_text").find("p","note").text
           for article in articles))

author_names = [article.find("p","note").text
                for article in articles]
(author_names)
# remove leading 'By' and split on commas
#authors = [x.strip() for x in re.sub("^By ", "", author_names).split(",")]

['By Scott Murray', 'By Federico Castanedo', 'By Hilary Mason, DJ Patil', 'By Carl Cocchiaro', 'By Aman Naimat', 'By Daniel Cowles', 'By Jerry Overton', 'By Mike Barlow', 'By Julie Steele', 'By Russell Jurney', 'By Raffael Marty', 'By Mike Barlow, Gregory Fell', 'By John King, Roger Magoulas', "Publisher: O'Reilly Media", "By O'Reilly Media, Inc."]


['By Scott Murray',
 'By Federico Castanedo',
 'By Hilary Mason, DJ Patil',
 'By Carl Cocchiaro',
 'By Aman Naimat',
 'By Daniel Cowles',
 'By Jerry Overton',
 'By Mike Barlow',
 'By Julie Steele',
 'By Russell Jurney',
 'By Raffael Marty',
 'By Mike Barlow, Gregory Fell',
 'By John King, Roger Magoulas',
 "Publisher: O'Reilly Media",
 "By O'Reilly Media, Inc."]

In [67]:
[str(x).strip() for x in [re.sub('By *', '', x) for x in author_names]]#.split(", ")]

['Scott Murray',
 'Federico Castanedo',
 'Hilary Mason, DJ Patil',
 'Carl Cocchiaro',
 'Aman Naimat',
 'Daniel Cowles',
 'Jerry Overton',
 'Mike Barlow',
 'Julie Steele',
 'Russell Jurney',
 'Raffael Marty',
 'Mike Barlow, Gregory Fell',
 'John King, Roger Magoulas',
 "Publisher: O'Reilly Media",
 "O'Reilly Media, Inc."]

ISBN seems to be contained in the link that’s in the thumbheader `<div>`

In [68]:
isbn_link = td.find("div", "thumbheader").a.get("href")
# re.match captures the part of the regex in parentheses
isbn = re.match("/product/(.*)\.do", isbn_link).group(1)

NameError: name 'td' is not defined

And the date is just the contents of the `<span class="directorydate">:`

In [69]:
date = td.find("span", "directorydate").text.strip()

NameError: name 'td' is not defined

Let’s put this all together into a function:

In [71]:
def book_info(td):
    """given a BeautifulSoup <td> Tag representing a book,
    extract the book's details and return a dict"""
    
    title = td.find("div", "thumbheader").a.text
    by_author = td.find('div', 'AuthorName').text
    authors = [x.strip() for x in re.sub("^By ", "", by_author).split(",")]
    isbn_link = td.find("div", "thumbheader").a.get("href")
    isbn = re.match("/product/(.*)\.do", isbn_link).groups()[0]
    date = td.find("span", "directorydate").text.strip()
    
    return {
        "title" : title,
        "authors" : authors,
        "isbn" : isbn,
        "date" : date
    }

In [None]:
### SCRAPE
from bs4 import BeautifulSoup
import requests
from time import sleep

base_url = "http://shop.oreilly.com/category/browse-subjects/" + \
    "data.do?sortby=publicationDate&page="
books = []
NUM_PAGES = 31 # at the time of writing, probably more by now

for page_num in range(1, NUM_PAGES + 1):
    print("souping page", page_num, ",", len(books)), " found so far"
    url = base_url + str(page_num)
    soup = BeautifulSoup(requests.get(url).text, 'html5lib')

    for td in soup('td', 'thumbtext'):
        if not is_video(td):
            books.append(book_info(td))
    # now be a good citizen and respect the robots.txt!
    sleep(30)

souping page 1 , 0
souping page 2 , 0
souping page 3 , 0
souping page 4 , 0
souping page 5 , 0
souping page 6 , 0
souping page 7 , 0
souping page 8 , 0
souping page 9 , 0
souping page 10 , 0
souping page 11 , 0
souping page 12 , 0
souping page 13 , 0
souping page 14 , 0
souping page 15 , 0
souping page 16 , 0
souping page 17 , 0
souping page 18 , 0
souping page 19 , 0
souping page 20 , 0


***NOTE***: Extracting data from HTML = more data art than data science.

Now that we’ve collected the data, we can plot the number of books published each year

In [None]:
def get_year(book):
    """book["date"] looks like 'November 2014' so we need to
    split on the space and then take the second piece"""
    return int(book["date"].split()[1])

# 2014 is the last complete year of data (when I ran this)
year_counts = Counter(get_year(book) for book in books
                      if get_year(book) <= 2014)

import matplotlib.pyplot as plt

years = sorted(year_counts)
book_counts = [year_counts[year] for year in years]
plt.plot(years, book_counts)
plt.ylabel("# of data books")
plt.title("Data")
plt.show()

Unfortunately, the would-be investor looks at the graph and decides that 2013 was “peak
data.

### Using APIs
Many websites + web services provide application programming interfaces (APIs), which allow you to explicitly request data in a structured format == saves you the trouble of having to scrape them

### JSON (and XML)

B/c HTTP = a protocol for transferring text, the data you request through a web API needs to be **serialized** into a string format. Often this serialization uses **JavaScript Object Notation (JSON)**. JavaScript objects look quite similar to Python `dicts`, which makes their string representations easy to interpret:

In [None]:
{ "title" : "Data Science Book",
"author" : "Joel Grus",
"publicationYear" : 2014,
"topics" : [ "data", "science", "data science"] }

Can parse JSON using `json` module, in particular, `loads()` = deserializes a string representing a JSON object into a Python object:

In [None]:
import json
serialized = """{ "title" : "Data Science Book",
                "author" : "Joel Grus",
                "publicationYear" : 2014,
                "topics" : [ "data", "science", "data science"] }"""

# parse JSON to create Python dict
