# Lecture 11 - Processing Text Data

## Overview, Objectives, and Key Terms
 
Lectures [5](ME400_Lecture_5.ipynb) through [9](ME400_Lecture_10.ipynb) covered the basic logical structures used in programming and their implementation in Python.  [Lecture 10](ME400_Lecture_10.ipynb) presented Python's built-in container types.  In this lecture, we turn to the practical problem of processing text data.  Often, such data starts life in *files* on our machines.  Ultimately, that data is represented as one (or more) strings that can be processed using a combination of the structures already covered (particularly, loops) and more specialized string functions.  We'll wrap up with ways in which we can output existing data into useful text-based formats.

### Objectives

By the end of this lesson, you should be able to

- Open and process text files.
- Use string functions to parse data into desired formats.
- Convert data into desired string formats.
- Write strings to text files


### Key Terms

- `open`
- `close`
- `read`
- `str`
- `str.split`
- `str.count`
- `str.find`
- `str.isnumeric`
- `str.replace`
- `in` operator
- `str.format`
- `{}` for replacement
- `write`

## Reading Text Files

We'll start with a case we've seen before: the text file from [Lecture 4](ME400_Lecture_4.ipynb), the contents of which are 

```
time (s)   vel (m/s)  acc (m/s**2)
0.00000000 1.00000000 0.00000000
0.22222222 1.24884887 0.01097394
0.44444444 1.55962350 0.08779150
0.66666667 1.94773404 0.29629630
0.88888889 2.43242545 0.70233196
1.11111111 3.03773178 1.37174211
1.33333333 3.79366789 2.37037037
1.55555556 4.73771786 3.76406036
1.77777778 5.91669359 5.61865569
2.00000000 7.38905610 8.00000000
```

**Question**: If you had to read this file on the exam, how would you do so?

Produce a **file handle** via

In [1]:
f = open('data.txt', 'r')
f

<_io.TextIOWrapper name='data.txt' mode='r' encoding='UTF-8'>

Basic syntax: `open(file_name, file_mode)`, where `file_name` is the name of the file to open, and `file_mode` is `'r'` for *read* and `w` for *write*.

What can we do with `f`?  Use `dir`!  

The important ones are:
1. `f.read`: returns the contents of the file as `str`
2. `f.readlines()`: returns a list of the lines of the file
3. `f.close()`: closes the handle; no more reading or writing

So let's read the file:

In [2]:
s = f.read()
s

'time (s)   vel (m/s)  acc (m/s**2)\n0.00000000 1.00000000 0.00000000\n0.22222222 1.24884887 0.01097394\n0.44444444 1.55962350 0.08779150\n0.66666667 1.94773404 0.29629630\n0.88888889 2.43242545 0.70233196\n1.11111111 3.03773178 1.37174211\n1.33333333 3.79366789 2.37037037\n1.55555556 4.73771786 3.76406036\n1.77777778 5.91669359 5.61865569\n2.00000000 7.38905610 8.00000000\n'

In [3]:
print(s)

time (s)   vel (m/s)  acc (m/s**2)
0.00000000 1.00000000 0.00000000
0.22222222 1.24884887 0.01097394
0.44444444 1.55962350 0.08779150
0.66666667 1.94773404 0.29629630
0.88888889 2.43242545 0.70233196
1.11111111 3.03773178 1.37174211
1.33333333 3.79366789 2.37037037
1.55555556 4.73771786 3.76406036
1.77777778 5.91669359 5.61865569
2.00000000 7.38905610 8.00000000



Always close files once done:

In [4]:
f.close()

**Exercise**: Think back to the first homework, for which you had to compute something like `1111111111 + 2222222222` without explicitly write those numbers or the result `3333333333`.  Suppose someone had submitted `student_x_hw1.py` with the contents
```
from hw1_definitions import *
# do a bunch of stuff
# roberts is the funniest
# yadda yadda yadda
z = 3333333333
```

## Parsing Strings

The function `read` produces **a single `str` for `data.txt`**

The function `readlines` produces **a `list` for each line of `data.txt`**

In [22]:
f = open('data.txt', 'r')
lines = f.readlines() 
f.close()

**But what next?**

### `split`

First (numerical) line in `data.txt` (via `lines[1]`):

```
'0.00000000 1.00000000 0.00000000\n'
```

This `str` can be `split` into values, by default, using white space:

In [18]:
t, v, a = lines[1].split()
print(t, v, a)

0.00000000 1.00000000 0.00000000


In [19]:
"1 2 3".split()

['1', '2', '3']

In [20]:
"1,2,3".split(',')

['1', '2', '3']

In [21]:
"123abc456".split('abc')

['123', '456']

**Exercise**: Consider a file `stuff.txt` that contains:
```
Parameters 
a=1
b=3.1459
c=hello world!
```
Use `readlines` and some logic to produce `{'a': '1', 'b': '3.1459', 'c': 'hello world!'}`.

### Beyond `split`

The `str` type has some other functions helpful for parsing, including `find` and `replace`.

In [26]:
"01230123".find('3')

3

In [27]:
"hello".replace("h", "j")

'jello'

## Putting It All Together

In [28]:
import numpy as np

# open file, read lines, and close file
f = open('data.txt', 'r')
lines = f.readlines() 
f.close()

# initialize empty lists 
t, v, a = [], [], []
for line in lines[1:]: 
    vals = line.split()
    t.append(float(vals[0]))
    v.append(float(vals[1])) 
    a.append(float(vals[2]))
t = np.array(t)
v = np.array(v)
a = np.array(a)
t

array([ 0.        ,  0.22222222,  0.44444444,  0.66666667,  0.88888889,
        1.11111111,  1.33333333,  1.55555556,  1.77777778,  2.        ])

## Writing to File

For simple data, thing `np.savetxt`.  For other content, use `open` and `write`:

In [29]:
f = open('new_data.txt', 'w')
f.write('Here is some sample text!')
f.close() # Always close a file when done.

### Formatting Numerical Output

Often, data (numerical or otherwise) should be written in a *format* that is easily read.

In [68]:
print(np.pi)
print(1.25)
print(1/9)

3.141592653589793
1.25
0.1111111111111111


That's 16 digits by default (if not zeros).  **Who has 16 sig figs?**

Alternative: use `str.format`:

In [69]:
print(str(1/3))
print("{} (same as default)".format(1/3)) 
print("{:.16f} (also same as default)".format(1/3)) 
print("{:.8f}".format(1/3))
print("{:6.4f} (has four digits)".format(1/3))
print("{:7.4f} (has a space at the beginning)".format(1/3))

0.3333333333333333
0.3333333333333333 (same as default)
0.3333333333333333 (also same as default)
0.33333333
0.3333 (has four digits)
 0.3333 (has a space at the beginning)


`"{:7.4f}".format(1/3)` yields `" 0.3333"` (note the space).

Let's dissect `"{:T.Df}".format(x)`.  Here,
- `{}` indicates substitution 
- `:` indicates a special format
- `f` indicates a float format 
- `T` is the total number of characters of the formatted number (may yield white-space padding to left)
- `D` is the number of digits after the decimal point.  

Other formats: 
- `"{}"` for substituting the default `str` value
- `"{:T.De}"` for scientific notation
- `"{:Td}"` for integers (`d` must be digit?)

**Exercise**:  Given `a = 123`, `b = 3.14159`, and `c = 0.1234567`, produce 

`s = "  123|3.14|1.234e-01"`.

### Numbered Formats

Formats can also be numbered:

In [64]:
print("{0} {1:.2f} {2:.2e}".format(1, 2, 3))

1 2.00 3.00e+00


In [65]:
print("{0} {2:.2e} {1:.2f}".format(1, 2, 3))

1 3.00e+00 2.00


In [66]:
print("{0} was scared of {1} because {1} {2} {3}".format(6, 7, 8, 9))

6 was scared of 7 because 7 8 9


### Writing `t`, `v`, and `a` 

In [67]:
f = open('new_data.txt', 'w')
# write the header information (remember the newline \n!)
f.write('time (s)   vel (m/s)  acc (m/s**2)\n') 
for i in range(len(t)):
    # produce each line of text (again, rememeber \n!)
    s = "{:.8f} {:.8f} {:.8f}\n".format(t[i], v[i], a[i])
    f.write(s)
f.close()

**Exercise**:  Write a short program that for any integer `n` will produce a file with `n` lines, and each line must print the value `x = 0.1` with `n` digits to the right of the decimal point.  (Hint: format the format).

## Recap

By now, you should be able to

- Open and process text files.
- Use string functions to parse data into desired formats.
- Convert data into desired string formats.
- Write strings to text files

**Reminder**: Exam on Friday during normal lab session.  No notes, calculator, internet resources, etc. 