# Announcements

* Next Tuesday will be the last graded tutorial!  Lectures will continue, with tutorial time going towards final projects.

# Working with files and data

<a href="https://www.theatlantic.com/technology/archive/2012/10/five-coolest-things-we-learned-googles-data-center-tour/322419/" target="_blank"><img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/hard-drives.jpg" width=600px /></a>
                                                                                                                                                      
## PHYS 2600: Scientific Computing

## Lecture 24

In [None]:
import numpy as np

## File systems

<a href="https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_177.htm" target="_blank"><img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/file-hierarchy.gif" width=400px style="float:right;" /></a>

First, some words about how computers store information.  All modern computers use some variation of a __hierarchical file system__, which consists of __files__ (discrete chunks of information) and __directories__ (which contain files or directories.)  

A single Jupyter notebook is a file; a directory is sort of like a Python list (an object that organizes references to other objects.)

Your files on Colab (just like on your computer) are stored in a __hierarchical file system__.



Like Python objects, files are big chunks of binary information - we need to know _what kind of information_ to read it correctly!  In Python, this extra information is the "type".  Files don't have types, but a common convention is to use a __file extension__ that signals what a file contains.

Extensions are separated with a dot:

`<name>.<extension>`

For example, `info.txt` is a file called "info" with the `txt` extension, which indicates that it contains _plain text_.  Two file extensions are commonly associated with Python:

* `.ipynb` is the extension for __Jupyter notebook__ (the exact spelling comes from the old name of the project, "iPython notebook".)
* `.py` is the extension for __pure Python code__.  This is the sort of file that exists in a module, or what you would use for a stand-alone Python script.  (You will write a module in the final project. More on modules next time!)

## File I/O

Reading and writing files in computing goes under the generic name of __file input/output__, or __file I/O__.  For many problems in scientific computing, file I/O is essential; large amounts of data are better stored in files than in notebooks (and many data files are produced directly by measurement equipment.)

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/folders_paths.png" width=400px style="float:right;"/>

Working with any file from Python requires knowing the __file path__: where it is relative to the Python script we're executing.  In the picture to the right, the data file paths we would use in `our_notebook.ipnyb` are:

* `'main_data.csv'`
* `'subfolder/sub_data.csv'`
* `'../outer_data.csv'`

Quick summary: forward slash `/` represents a folder, `../` means "go back up one folder level".  This is Mac/UNIX path convention, Windows [works a bit differently](https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f).

Here's a [quick little tutorial](https://www.kirupa.com/html5/all_about_file_paths.htm) on paths for an alternative explanation.

In [None]:
import os
import urllib.request

remote_dir = "https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/"
local_dir = "./" # or whatever directory you want the file to be in. "./"="current working directory"
filename = "example_data_1.lsv"
remote_url = remote_dir + filename
local_path = os.path.join(local_dir, filename)
os.makedirs(local_dir, exist_ok=True) # Ensure local directory exists. Redundant for current working directory.
if not os.path.exists(local_path):
  urllib.request.urlretrieve(remote_url, local_path)


In scientific computing, __data files__ are ubiquitous - they contain raw data of some kind.  For most scientific purposes, __human-readable__ file formats are desirable (i.e., the data should appear as plain text and numbers if the file is opened in a text editor.) For large amounts of data that creates large files, compressed binary formats are often used.

We'll start with a very simple example, "line-separated value" (__LSV__) format.  In this format, every line of the file contains a single piece of data:
```
5.4
33.7
-2.2
1.9
(...)
```

I've created a file called `example_data_1.lsv` that contains some numbers in this format.


To read the file in Python, we can use the `open` built-in command:

In [None]:
data_file = open('example_data_1.lsv')
data_file

This gives us an object which represents an __input stream__ from the file. This is somewhat like a phone call: we can read  from/write to the file, and other programs can't access it while we have the file open. 

Now let's read from it:

In [None]:
data_file.readline()

Note that `.readline()` remembers the _state_ of the file connection: if we call it repeatedly, we'll always get the next line we haven't read yet.

Reading single lines at once is useful for really large files that won't fit in RAM all at once, but for most purposes, we're better off using `.readlines()` to get them all at once:

In [None]:
lines = data_file.readlines()
lines

Note that this remembers state, just like `.readline()` - if we call it again ,we get nothing!  To read the file again, we have to use the `.open()` command again to get a new file connection.

When we're done with a file, it's best practice to _close the connection_ with `.close()`:

In [None]:
data_file.close()
# data_file.readline()

Python will automatically close file connections for us in many cases, but if you don't do it yourself it may stay open for a while.  This can lead to problems such as:
- __Blocking__ (other programs might be unable to access the file while our connection is open); or 
- __File corruption__ (bad/unintended data getting written to a file connection.)

Best practice is to use a brand-new Python keyword, `with`!  The `with` keyword creates a block, much like `if`, `for`, etc.  A `with` block is known as a __context__.

In [None]:
with open('example_data_1.lsv') as data_file:
    lines = data_file.readlines()

print(lines[0])
#data_file.readline()  # This will give an I/O error!

Inside the `with` block, `data_file` can be used as we did above, but as soon as the block ends (for any reason, including errors!) the file is closed __automatically__.  

Note that `with` is a very specialized keyword; it only works with Python objects that define a context-based behavior.  (This is mostly things related to files or other active data connections.)

What about __writing data__ out to a file?  For special file formats as above, there will be methods available for writing as well as reading.  In general, we just use the _mode_ argument to the `open()` built-in command:

In [None]:
with open('a_new_file.txt', 'w') as new_file:
    new_file.write('Hello, \n')
    new_file.write('world!')

with open('a_new_file.txt', 'r') as new_file:
    print(new_file.readlines())

Notice that unlike `print()`, the `write()` command __does not add newlines automatically__ - we need to use the `\n` character explicitly to get line breaks.

There are [several different modes available](https://docs.python.org/3/library/functions.html#open) for `open()`.  The most common are `'r'` (read from file, default), `'w'` (write to file, removing any existing data), and `'a'` (write to file, appending to the end of existing data.)  If we used `'a'` instead of `'w'` above, we would get another copy of `'Hello, world!'` added to the file every time we run the cell.

## Data wrangling 

Getting the data from the file is only the beginning!  If we actually want to work with it, there is usually extra work to do.  Let's look at the data we read in from `example_data_1.lsv`:

In [None]:
print(lines[0:3])

Since the data file is text-formatted, we get the data as _strings_ instead of numbers.  Moreover, they all have the `\n` symbol at the end.  You may remember from way back that `\n` is the __new-line__ character, one of a set of special characters denoted with the __escape character__ `\`.  As we saw before, we can enter these manually into strings:

In [None]:
print('Line 1\nLine 2')

The only other special character you're likely to encounter in data files is the tab character, `\t`, but [here's a more complete list](https://docs.python.org/2.0/ref/strings.html) just in case.

To work with the actual numbers, we need to get rid of `\n`!  The quickest way to do this is with the `.strip()` string method:

In [None]:
lines[0].strip()

`.strip()` automatically removes _all blank space_, including `\n`, from the _beginning and end of a string_.  (It can also be used to remove any other characters we want from the ends of a string - see the documentation.)  Let's process our data into a numeric list:

In [None]:
proc_data = []
for raw_data in lines:
    proc_data.append(float(raw_data.strip()))
    
print(proc_data)

This is an example of __data wrangling__: taking raw input data and changing its formatting, to get it ready for further analysis.

## Other file formats

There are many other common data formats in widespread use:
* Spreadsheet formats (`.xlsx`, `.xls`, `.ods`, although most spreadsheet programs also read/write CSV)
* XML
* JSON
* YAML

For any of these formats, there's a Python module specifically devoted to reading/writing it (`openpyxl` and `xlutils`, `xml`, `json`, `pyyaml`).  As always, you don't need to know anything about these modules until the time comes that you need to deal with that kind of data file!

Other powerful modules for dealing with data can read many of these types natively - one of the most popular is called `pandas`, which is based around powerful data manipulations using a tabular data type called a "data frame".  (`pandas` is meant to be familiar to users of the `R` programming language.)

Beyond structured text formats, there are a near-infinite number of __binary file formats__, where the data stored is just raw binary data.  For a human-readable file you can guess the format by looking at the data, but for raw binary this is basically impossible!

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/hex-edit.jpg" width=300px style="float:right;" />

Binary formats are common for things like images, audio recordings, and movies, where compression is important (and looking at a text-based representation wouldn't really help anyway.)  They are also used for storing complex data only meant to be used by a single program.

In fact, Python itself has a built-in module called `pickle`, which allows us to read and write binary representations of arbitrarily complex Python objects!

There are uses for `pickle`, but __it should be avoided for saving and sharing scientific data__ - without having the same version of Python with the same modules, restoring a `pickle` object may not actually work!  (There are also concerns about security, since `pickle`s can be used to execute arbitrary code.)

## Tutorial 24

Go ahead and open the tutorial!