# Reading data from files

Sometimes we need to read data from files. In general, these will be text files or binary files. Text files are easy to read, binary files are not.

Let's start with writing and then reading a bit of text.

Now we can read this data:

## File modes

You have to decide what you want to do with the file.

- **`r`** &mdash; read only (default)
- **`r+`** &mdash; read and write (pointer at 0 &mdash; careful to manage the pointer!)
- **`w`** &mdash; write new file (clobbers existing files)
- **`a`** &mdash; append existing

You can also add another letter to indicate whether you're handling text or bytes:

- **`t`** &mdash; text (default)
- **`b`** &mdash; bytes

For example, to open an existing text file for appending data to the end:

    with open(fname, 'at') as f:
        f.write('New data')

## Read some data

Let's read some tops data:

Note that we can also use this pattern:

In [None]:
with open('../data/L-30_tops.txt', 'r') as f:
    for line in f:
        print(line, end='')

<div class="alert alert-success">
<h3>Exercise</h3>

Write a `for` loop to read the lines of the file one by one, adding key: value pairs to a dictionary as you go.

<a title="You will need to skip the loop over lines that look like comments. Use str.split() to break the line at a comma, and `float()` to convert strings to numbers.">**Hover for hints**</a>
</div>

In [None]:
# YOUR CODE HERE




<div class="alert alert-success">
<h3>Exercise</h3>

Your challenge is to turn this into a function, complete with docstring and any options you want to try to implement. For example:

- Try putting everything, including the file reading into the function. (Better yet, write functions for each main 'chunk' of the workflow.)
- When the function works on `../data/L-30_tops.txt`, make it work on `../data/B-41_tops.txt`. (Remember you have some 'depth, unit' code from Day 1.)
- You could let the user choose different 'comment' characters.
- Let the user select different delimiters, other than a comma.
- Other things, like skipping lines or transforming the case of the names, could also be optional.
- Maybe print some 'progress logging' as you go, so the user knows what's going on.
- Don't forget the docstring!

When you're done, add the function to `utils.py`.
</div>

In [None]:
def clean_depth(string):
    """Clean the units from a number."""
    if 'm' in string.lower():
        units = 'M'
    elif 'f' in string.lower():
        units = 'FT'
    else:
        units = None

    stripped = string.lower().strip(' .mft\n\t')
    value = stripped.replace(',', '')

    return float(value), units

In [None]:
get_tops_from_file('../data/B-41_tops.txt', comment='%')

## Reading multiple files

Sometimes we want to crawl directories. Usually, you can accomplish this with `glob`:

In [None]:
import glob

glob.glob('../data/*.[Ll][Aa][Ss]')

In [None]:
pd.read_csv??

This is just a list of path strings, so you can loop over it to read multiple files. It supports all the usual syntax for UNIX-style globbing, including recursion over directories with `**`. 

## Intro to Python students: stop here for now

----

## Reading without loops: `re`

In [None]:
# With regex.
import re

with open('../data/L-30_tops.txt') as f:
    data = re.findall('(.+?),([.0-9]+)', f.read())
    tops = {name: float(depth) for name, depth in data}

In [None]:
tops

## Reading without loops: `map` and `filter`

In [None]:
def process_line(line):
    """Process one valid line.
    """
    k, v = line.split(',')
    return k, float(v)

def is_valid(line):
    """Decide is a line is processable.
    """
    return '#' not in line

with open('../data/L-30_tops.txt') as f:
    tops = dict(map(process_line, filter(is_valid, f)))
    
tops

## Read using NumPy

We can use `np.loadtxt()` for numeric files.

In [None]:
import numpy as np
np.loadtxt('../data/L-30_tops.txt', skiprows=1, usecols=[1], delimiter=',')

Or there's [`np.genfromtxt()`](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.io.genfromtxt.html), which copes better with missing values &mdash; try running it on `'../data/B-41_tops.txt'`.

In [None]:
np.genfromtxt('../data/L-30_tops.txt', skip_header=1, delimiter=',')

Both functions have a useful keyword argument, `unpack`, which you should set to `True` to get the columns back as separate vectors.

Note that both functions can read GZIP files too.

## `csv` built-in module

In [None]:
import csv

with open('../data/L-30_tops.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

In [None]:
import csv

with open('../data/L-30_tops.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['Formation name'], row['Depth [m]'])

## Read file using pandas

In [None]:
import pandas as pd

df = pd.read_csv('../data/L-30_tops.csv')

In [None]:
df

In [None]:
import pandas as pd

df = pd.read_csv('../data/L-30_tops.txt', skiprows=1, names=['Formation', 'Depth'])

In [None]:
df

In [None]:
df['Formation'] = df['Formation'].str.title()
df.head()

In [None]:
df.to_csv('../data/L-30_tops_improved.csv')

<div class="alert alert-success">
<h3>Exercises</h3>

- Read the data from B-41_tops.txt
- Write a function that will load data from either of these files
- Load the data to pandas
- Write a new CSV files with the cleaned data
</div>