## Chapter 9. Getting Data

### stdin, stdout
 
If running Python scripts at the command line, you can **pipe** (`|`) data through them using `sys.stdin` and `sys.stdout`. 
Ex: Scripts that reads in lines of text + spits back out the ones that match a RegEx and counts the lines it recieves + writes out that count

In [1]:
# egrep.py
import sys, re

# sys.argv = list of CLI args
# sys.argv[0] = name of the program itself
# sys.argv[1] = RegEx specified at the CL
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if matches RegEx, write to stdout
    if re.search(regex,line):
        sys.stdout.write(line)
        
# line_count.py
import sys

count = 0
for line in sys.stdin:
    count += 1
# print goes to stdout
print(count)

0


Can use these to count how many lines in a file contain a #

In [None]:
# windows
#!type someFile.txt | python egrep.py "[0-9]" | python line_count.py

# linunx
#!cat someFile.txt | python egrep.py "[0-9]" | python line_count.py

In [4]:
# script that counts words in its input + writes out most common ones:
# most_common_words.py
import sys
from collections import Counter

# pass in number of words as 1st arg
try:
    num_words = int(sys.argv[1])
except:
    print("usage: most_common_words.py num_words")
    sys.exit(1) # non-zero exit code = indicates error

counter = Counter(word.lower()
                 for line in sys.stdin
                 for word in line.strip().split() # split on space 
                 if word) # skip empty words

for word,count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")

### Ex of above script
## type the_bible.txt | python most_common_words.py 10
# 64193 the
# 51380 and
# 34753 of
# 13643 to
# 12799 that
# 12560 in
# 10263 he
# 9840 shall
# 8987 unto
# 8836 for

usage: most_common_words.py num_words


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### Reading Files

Can also explicitly read from + write to files directly in Python code.

### Basics of Text Files

1st step to working with a text file = **obtain a *file object* via `open`**

In [None]:
# 'r' = read only
file_for_reading = open("reading_file.txt", "r")

# 'w' = write = ***DESTROYS files if it already exits****
file_for_writing = open("writing_file.txt", "w")

# 'a' = append (to end of file)
file_for_appending = open("appending_file.txt", "a")

# close files when done
file_for_writing.close()

It's easy to forget to close files, so always use them in a **`with` block**, at the end of which they will be closed automatically:

In [None]:
with open(filename,'r') as f:
    data=function_to_get_data_from(f)
    
# at this point, file has been closed, don't try to use it
process(data)

To read over whole text file, iterate over lines with `for`

In [None]:
starts_with_hash = 0

with open("input.txt", "r") as f:
    for line in file:
        if re.match("^#",line): # check if each line starts with # and count if True
            starts_with_hash += 1

Every line you get this way ends in a **newline** character, so you’ll often want to `strip()` it before doing anything with it.

Ex: You have a file full of email addresses, 1 per line, that you need to generate a histogram of the domains. The rules for correctly extracting domains are somewhat subtle (e.g., the Public Suffix List), but a good 1st approximation = just take the parts of the email addresses that come after the @ (Which gives the wrong answer
for email addresses like "@mail.datasciencester.com")

In [6]:
def get_domain(email_address):
    """Split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]

with open("email_address.txt", "r") as f:
    domain_counts - Counter(get_domain(line.strip())
                           for line in f
                           if "@" in line)

FileNotFoundError: [Errno 2] No such file or directory: 'email_address.txt'

### Delimited Files

More often we work w/ files w/ lots of data on each line that're very often either comma or tab-separated. Each line has several fields, w/ a comma/tab indicating where 1 field ends + the next starts.

This gets complicated when you have fields w/ commas + tabs + newlines in them. For this reason, it’s pretty much always a mistake to try to parse them yourself. Instead, use Python’s `csv` module (or `pandas` library).

For technical reasons, always work w/ CSV files in **binary mode** by including a `b` after the `r` or `w` (see Stack Overflow).

If your file has no headers (which means you probably want each row as a list, + which places the burden on you to know what’s in each column), you can use `csv.reader` to iterate over rows, each of which will be an appropriately split list.

Ex: TSV of stock prices:

In [13]:
import csv

def process(date, symbol, price):
    print(date, symbol, price)

with open("tab_delimited_stock_prices.txt", "r") as f:
    reader = csv.reader(f, delimiter="\t")
    
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date,symbol,closing_price)

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34


If the file has headers, we can skip them (with an inital call to `reader.next()`) or get each row as a `dict` (with headers as keys) by using `csv.DictReader`)

In [12]:
import csv

def process(date, symbol, price):
    print(date, symbol, price)

with open("colon_delimited_stock_prices.txt", "r") as f:
    reader = csv.DictReader(f, delimiter=":")
    
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        process(date,symbol,closing_price)

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5


Even if the file doesn’t have headers we can still use `DictReader` by passing it the keys as a `fieldnames` parameter, + we can similarly write out delimited data using `csv.writer`:

In [16]:
todays_prices = {"AAPL":90.91,"MSFT":41.68,"FB":64.5}

with open("comma_delimited_stock_prices.txt", "w") as f:
    writer = csv.writer(f, delimiter=",")
    
    for stock,price in todays_prices.items():
        writer.writerow([stock,price])

`csv.writer` does the right thing if fields themselves have commas in them. Your own hand-rolled writer probably won’t. For example, if you attempt:

In [None]:
results = [["test1", "success", "Monday"],
           ["test2", "success, kind of", "Tuesday"],
           ["test3", "failure, kind of", "Wednesday"],
           ["test4", "failure, utter", "Thursday"]]

# BAD - DON'T DO
with open("bad_csv.txt","wb") as f:
    for row in results:
        f.write(",".join(map(str,tow))) # might have too many commas in it
        f.write("\n") # row might already have newlines

You will end up with a csv file no one will ever be able to make sense of that looks like:

* test1,success,Monday
* test2,success, kind of,Tuesday
* test3,failure, kind of,Wednesday
* test4,failure, utter,Thursday


### Scraping the Web

Fetching web pages = easym, getting meaningful structured info out of them = less so