# Files

In this notebook we'll look at getting data into and out of files:

- Writing to a text file.
- Reading from a text file.
- Reading from lots of text files and reorganzing data.

We won't talk about binary files.

**In general, for reading most files, you will first look for a library that provides a file reader for your file type.**


## Writing files

Let's make a text file!

We will need a way to refer to the file, using a **path**. For that, we can use `pathlib`. 

In [None]:
import pathlib

file = pathlib.Path('./myfile.txt')

file

This path objects knows about the file's name structure, and where the file is located on disk:

In [None]:
file.stem, file.suffix, file.resolve()

We can write to the file like so:

In [None]:
with file.open(mode='wt') as f:
    f.write('My text data.')

There's also a shortcut method:

In [None]:
file.write_text("My text data.")

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Exercise</h3>

Write this small CSV to a file. <a title="Write a loop. You need to separate the lines with a newline character at the end of each line.">HINT.</a>

```
lines = ["depth,gr,rhob", "1022.1,78,2019", "1023.0,84,2034", "1023.9,89,2045"]
```
</div>

In [None]:
file = pathlib.Path('./mydata.csv')

lines = ["depth,gr,rhob", "1022.1,78,2019", "1023.0,84,2034", "1023.9,89,2045"]

# YOUR CODE HERE



### File modes

You have to decide what you want to do with the file.

- **`r`** &mdash; read only (default)
- **`r+`** &mdash; read and write (pointer at 0 &mdash; careful to manage the pointer!)
- **`x`** &mdash; open for exclusive creation, failing if the file already exists (add `+` for read and write)
- **`w`** &mdash; write new file **and clobbers existing file if it already exists**
- **`a`** &mdash; append existing

You can also add another letter to indicate whether you're handling text or bytes:

- **`t`** &mdash; text (default)
- **`b`** &mdash; bytes

For example, to open an existing text file for appending data to the end:

    with open(fname, 'at') as f:
        f.write('New data')

---

## Read a text file

There is a convenient way to read all the data in a small file:

In [None]:
file.read_text()

If the file is very large, we might not want it all in memory. We can read it line by line:

In [None]:
with file.open(mode='rt') as f:
    line1 = f.readline()
    line2 = f.readline()  # Note the pointer has moved.
    rest_of_lines = f.readlines()

line1, line2, rest_of_lines

We can also iterate over the lines to process each in turn:

In [None]:
with file.open(mode='rt') as f:
    for line in f:
        print(line)

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

❓ Why does it look like there are blank lines in the file?
</div>

### `pathlib.Path().open()` vs `open()`

You will probably see tutorials etc that do it this way instead:

```python
path = './myfile.txt'

# Write data.
with open(path, 'w') as f:
    f.write('My text data.')

# Read data.
with open(path, 'w') as f:
    data = f.read()
```

And this is perfectly fine. But `pathlib` is incredibly useful, especially if you want to make things that run on different platforms (eg Linux and Windows). And it offers "one obvious way" to do things with files.

## A sneak peek at Pandas

Let's read our CSV file with Pandas!

## Read a zip file

The data for these exercises were generated from the following USGS dataset: https://pubs.usgs.gov/dds/dds-033/USGS_3D/ssx_txt/3dstart.htm

We're going to read data from some specially formatted TOPS files. Let's check one out:

In [None]:
import requests
import io
from zipfile import ZipFile

url = "https://github.com/scienxlab/datasets/raw/refs/heads/main/usgs/sussex.zip"

r = requests.get(url)
file_like = io.BytesIO(r.content)

with ZipFile(file_like) as zf:
    for file in zf.namelist():
        print(file)

We can also read a single file from the zip file:

In [None]:
with ZipFile(file_like) as zf:
    with zf.open('UWI_4900527320.tops', mode='r') as f:

        # Print the name of the file.
        print(f.name)
        print(len(f.name) * '=')

        # print the contents.
        print(f.read().decode())

This is cool, and we could continue to read from the zip file like this, but I think the next part will be easier to understand if we extract the files:

In [None]:
with ZipFile(file_like) as zf:
    for file in zf.namelist():
        zf.extract(file, path='./sussex')

### Globbing

We can visit the files in a directory with ['globbing'](https://en.wikipedia.org/wiki/Glob_(programming)) patterns:

In [None]:
path = pathlib.Path('./sussex')



<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Exercise</h3>

- Write a `for` loop to read the lines of `UWI_4900527320.tops` one by one. <a title="Use either f.readlines() OR for line in f:">**HINT**</a>
- Find the line containing `UNITS:` and print the units abbreviation from it. <a title="Use the keyword 'in' to test for the substring 'UNITS:' then use the string method split(':') on the line to break it into parts.">**HINT**</a>
- You can add `break` after your print statement, as Python can stop reading the file at that point.
</div>

In [None]:
file = pathlib.Path('./sussex/UWI_4900527320.tops')

# YOUR CODE HERE



<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Exercise</h3>

Now retrieve the units **and** the value of the Top Cody formation, eg from the line like `"Cody,2470.10"`. <a title="Use `in` and `str.split()` as before. Remember to cast the value as a float.">**HINT**</a>

Print the depth value and the units.
</div>

In [None]:
# YOUR CODE HERE



### A quick word about units

Don't try to mess about with units in real life, just use [`pint`](https://pint.readthedocs.io/en/0.10.1/tutorial.html).

In [None]:
from pint import UnitRegistry

ureg = UnitRegistry()
Q_ = ureg.Quantity

# Parse a string:
quantity = Q_('2,470.1 m')
quantity

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Exercise</h3>

Turn your code into a function that takes the filename as an argument. Return the value and the units.

Don't forget the docstring!
</div>

In [None]:
def get_top_cody(file):

    # YOUR CODE HERE



### Exercise

There are several formations in these files: 

- Ardmore
- Cody
- Niobrara
- Sussex Upper Top
- Sussex Upper Base
- Sussex Lower Top
- Sussex Lower Base
    
Modify your function to extract the top of any formation.

Before changing any code, ask yourself what is the smallest possible change you can make?

In [None]:
get_top(file, formation):

    # YOUR CODE HERE



<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Continue exploring</h3>

Reminder: we don't really need to read these files, because we already have access to this data in a more useful format.

But! You might like to continue improving the function. There is still lots to do:

- Read all the files starting with `UWI_` and print the depth to top Niobrara for each one.
- The units vary. Write a function to convert them all to 'ft' or 'm', and do the conversion. (You could also add the option to convert to the `get_tops()` function.)
- The depths are relative to the ground level (GL). Read the GL value from the file and use it to correct the formation depth to elevation.
- Read the location from the file and use these data to make a map, eg of the depth to the top Cody.
- Instead of reading just one formation, you could iterate over the list of formations to get them all. You might have to deal with one or more being missing.
- Or, you could read everything in the `Formations` section of the file. Then you don't need to know the list of formations in advance.
- The files without `UWI_` at the start of the filename will need slightly different treatment (different comment character, different delimiter). Can you adapt your function to deal with all the files?

</div>

<hr />
<small>
<p>Copyright 2025 Matt Hall (Equinor)</p>

<p>Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:</p>

<p>The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.</p>

<p>THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.</p>
</small>