# Tutorial 24: Data files and I/O

## PHYS 2600

In [None]:
# Run this cell to import packages and download the files used in the tutorial
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import os
import urllib.request

remote_dir = "https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/tutorials/tut24/tut24_data/"
local_dir = "tut24_data" # equivalent to "./tut24_data"
filenames = ["example_data_1.lsv", "example_data_tut24.csv", "temp_data.lsv", "weather_data_boulder_0918.csv"]

# Ensure local directory exists
os.makedirs(local_dir, exist_ok=True)

for filename in filenames:
    remote_url = remote_dir + filename
    local_path = os.path.join(local_dir, filename)
    # Download the file if missing
    if not os.path.exists(local_path):
        urllib.request.urlretrieve(remote_url, local_path)


## T24.1 - Basic file I/O 

Let's start with some basic file parsing.  We'll begin the same line-separated value file `example_data_1.lsv` that  I used in lecture.  First, I've given you the code I wrote in lecture to read the file and process the data into a numerical list:  __run the cell below__ to do this.

(Note that I put this data file, and all the others you'll be using below, into a directory called `tut24_data`.  To access files in the directory, we use path notation with a `/` after the directory name, as you can see below.)



In [None]:
with open('tut24_data/example_data_1.lsv') as data_file:
    lines = data_file.readlines()

proc_data = []
for raw_data in lines:
    proc_data.append(float(raw_data.strip()))
    
print(proc_data)

Now, your job is to reverse the process!  In the cell below, write some code which will __write a new LSV file called `tut24_1A.lsv`__ containing the numbers in `proc_data`.  If you do this correctly, the file you produce should look _identical_ to the original `example_data_1.lsv` file.  (Don't forget to include the newline `\n` characters when you're writing out!)

_(Reminder: to write instead of reading, use the syntax `open(file_name, 'w')`.)_

In [None]:
# 

In [None]:
## Testing cell - read the new file back as LSV data, make sure it matches proc_data
with open('tut24_1A.lsv') as data_file:
    test_lines = data_file.readlines()

test_data = []
for raw_data in test_lines:
    test_data.append(float(raw_data.strip()))
    
print(test_data)

import numpy.testing as npt
npt.assert_allclose(test_data, proc_data)

### Part B

Next, we'll deal with a couple of complications.  The file `temp_data.lsv` in the `tut24_data/` folder contains some more numbers, corresponding to a sequence of temperature measurements.  If you open the file, you'll see that the first row is a __header__: rather than a number, it is a string which describes the following data.  (This is an example of __metadata__ - data that exists to tell us about other data.)

Headers are useful to have in data files - they tell us important information about the data.  But since they're not the same as the rest of the data, we have to discard them when we parse the rest of the data!

In the cell below, __read the file `temp_data.lsv`__ to produce a list called __`temp_data`__.  This list should be a list of floats, like `proc_data` above - which means it should _not_ include the header.

_(Hint: remember, both `.readline()` and `.readlines()` have state memory; they won't read the same line twice.)_

In [None]:
# 


In [None]:
print(temp_data)

assert len(temp_data) == 7
npt.assert_allclose(temp_data, [34, 35, 40, 31, 25, 36, 42])


Now, repeat the exercise that you did in part A: write `temp_data` out to a file in LSV format.  But write it to the __same file__, `tut24_1A.lsv`, using the "append" mode so that the previous data remains.  If you do this right, you should see a file containing the 15 numbers from `example_data_1.lsv` followed by the 7 numbers from `temp_data.lsv`.

_(Reminder: to append, use the syntax `open(file_name, 'a')`.  If you make a mistake, you might need to re-run your code from part A to re-create the original file before you append to it here.)_

In [None]:
# 

In [None]:
## Testing cell - read the final version of tut24_1A.lsv
with open('tut24_1A.lsv') as data_file:
    test_lines = data_file.readlines()

test_data = []
for raw_data in test_lines:
    test_data.append(float(raw_data.strip()))
    
print(test_data)
assert len(test_data) == 22

## T24.2 - Comma-separated value files


Line-separated value data is too simplistic; it's not used much in practice.  Instead, the most popular basic text-file format for storing data is __comma-separated values__, or "CSV".  CSV is a _tabular_ (two-dimensional) format, using lines to separate rows and commas to separate columns:

```
t,x,y,z
0.0,1.3,0.2,-0.4
0.2,2.4,0.7,-1.1
(...)
```

When parsing a CSV file, we still need to use `.strip()` to get rid of newline characters.  But we also need to split apart our data using the commas.  To accomplish that, we can use another version useful method called `.split()`.  Here's a demonstration:



In [None]:
csv_line = '0.0,1.3,0.2,-0.4\n'
print(csv_line.split(','))
print(csv_line.strip().split(','))  ## Strip first to get rid of the \n

### Part A

Let's get some practice with the CSV format.  I've provided a comma-separated value file called `example_data_tut24.csv`.

Since a CSV file is human-readable, a good place to start is to __open the file and look at its contents.__  You should see how many rows and columns to expect.  You'll also notice that this particular file contains a __header__, once again: the very first line of the file doesn't contain data, but instead a set of strings that describe the data columns below.

The cell below contains an example line of data from this CSV file, formatted as a string.  To get the numbers out as a list, you should carry out the following steps:

1. Use the `.strip()` string method to get rid of the newline character `\n`.
2. Use the `.split()` string method to divide the string into smaller strings.
3. Convert each string in the list to a floating-point number using the `float()` type-casting function.  (You'll need a `for` loop to run through the list.)

__Implement the function `parse_line_csv` below__ to carry out these three steps, then run the cell below to run it on `sample_line` and check that you parsed it correctly.

In [None]:
sample_line = '1,3.31,-0.27,7.79\n'

def parse_line_csv(line):
    #

In [None]:
parsed_line = parse_line_csv(sample_line)
print(parsed_line)

import numpy.testing as npt
npt.assert_allclose(parsed_line, [1.0, 3.31, -0.27, 7.79])

### Part B

Now you're ready to process the whole data file!  In the cell below, use the `with open(...) as ...` syntax to open the file `example_data_tut24.csv`, and then use `readline()` or `readlines()` and your function from part A to create a two-dimensional NumPy array containing the data.  __Save it to a variable called `proc_data`.__

_(Hint: don't forget the header line!  Since it contains metadata and not data, you should store that to a different variable or discard it entirely.)_

_(Another hint: it's easiest to make a list of lists, and then use `np.array()` to typecast at the end.  You could also allocate an array of zeroes and then fill it in, line by line, but that requires knowing the exact dimensions of the data before you start by looking at the file - not always practical!)_

In [None]:
data_filename = 'tut24_data/example_data_tut24.csv'

#

In [None]:
print(proc_data)
print(proc_data.shape)

import numpy.testing as npt
npt.assert_allclose(proc_data[9], [10, -4.63, 3.13, 7.47])
assert proc_data.shape == (12,4)

### Part C

Now our data is ready to use!  Let's apply a transformation.  Say we want to convert the middle two columns (x,y) to a single distance $d = \sqrt{x^2 + y^2}$ (discarding the final 'time' measurement column.)  That's easy enough to do:

In [None]:
distance_data = np.zeros((12,2))
distance_data[:,0] = proc_data[:,0]
distance_data[:,1] = np.sqrt(proc_data[:,1]**2 + proc_data[:,2]**2)

print(distance_data)

Now we'd like to __write the transformed data out__ to a new file called `distances.csv`.  Use the `with open(..., 'w') as ...` context syntax to open `distances.csv` for writing, and then use the `.write()` file method to write lines to the file.  

__Use string formatting with the `g` format code__ for both numbers, and don't forget to include the newline `\n` at the end of every line!

In [None]:

#

In [None]:
with open('distances.csv', 'r') as dist_file:
    dlines = dist_file.readlines()

print(dlines)
assert len(dlines) == 12
assert dlines[4] == '5,4.61525\n'

## T24.3 - Data mining the weather (optional challenge)

_(Note: I'm including a very long optional challenge this time.  If you're not currently working with real data files, or planning to do so soon, then the two parts of this tutorial above are probably good enough for you to learn the basics of file I/O.  But real data comes with a lot of new challenges, so I'm hoping this part will be a valuable exercise for some of you.)_


Finally, let's try working with some real data!  The provided file `weather_data_boulder_0918.csv` contains [NOAA weather data](https://www.ncdc.noaa.gov/cdo-web/) for various weather stations in the vicinity of Boulder, taken over 28 days in the month of September 2018.  The columns, in order, are:

* Weather station ID
* Weather station name
* Date of observation
* Precipitation total (inches)
* Average temperature (degrees F)
* Max temperature (degrees F)
* Min temperature (degrees F)

Since this is a real dataset, we'll encounter many real-world data wrangling problems: "cleaning" out extraneous data we don't care about, transforming and combining data, and so on.

Once again, start by __opening up the raw data file and looking through it__ to get a sense for the raw data.  One line will look something like this:

"USS0005J42S","NIWOT, CO US","2018-09-14","0.00","57","72","41"

In fact, this line with all data filled in is _rare_: most of the stations reported seem to only measure precipitation, so their temperature columns are missing entirely.  Of the stations reporting temperatures, most only record max/min and not the average.  Real data is often messy!


### Part A

Notice that this file is especially challenging to parse: this is a variation on CSV where the data are contained in double quotes `""` and _then_ separated by commas.  This is done so that commas can be used _inside_ the dataset, as in the station name above.

As a warm-up, I've included a single line of the data file as a string below.  __Extract the precipitation, max temperature, and min temperature__ from this string as a 3-entry NumPy array.


_(Hint: if you just pretend this is a regular CSV file and split on the commas, you'll end up breaking apart the station name field.  But if you only want the precipitation and temperature data, you can just ignore the name - but keep track of which column ends up where in the list after using `split`...)_

_(Another hint: the double quotes `"` will only get in your way here!  You can remove all the instances of a character from a string by using the `.replace()` method.  For example, `"hello world".replace('l', '')` will give you the string `'heo word'`.)_

In [None]:
wline = '"USS0005J42S","NIWOT, CO US","2018-09-14","0.00","57","72","41"\n'

#

### Part B

Now parse the whole data file, and __create a 2-D NumPy array containing the precipitation, max temperature, and min temperature__ for __only a single Boulder station, ID `USW00094075`.__  There are two ways you can do this:

1. Pretend this is a regular CSV file, and parse it by splitting on the commas as you did for the single line in part A.
2. Use the `csv` module [see the documentation here](https://docs.python.org/3/library/csv.html), and use a `csv.reader` to parse the dataset.  The `csv` module can recognize variations of CSV like this one and will deal with the quotes properly if you set the `quotechar` argument.

In [None]:

weather_filename = 'tut24_data/weather_data_boulder_0918.csv'
boulder_data = []

#

print(boulder_data)
    
    

### Part C

Now extract the following quantities from your cleaned array of data in the cell below:

* Total precipitation;
* The lowest minimum temperature and highest maximum temperature;
* The average temperatures (obtained by averaging min/max temperature) on every Tuesday.

_(Hint: for the last one, use `np.mean()` and slicing to produce a 1d array containing the average temperature, then one more slice to cut the list down to the four Tuesdays only.  The data runs from 9/2 to 9/29, so the first day is a Sunday and the last is a Saturday.)_

In [None]:
#

### Part D

Can you go back and report the lowest minimum and highest maximum temperature across _all weather stations_ in the file?  (This is tricky because many of the stations don't report any temperatures at all!)

In [None]:
#