# **Lab 7 — Inputs and Outputs**
---

## Introduction

A proper understanding of how to get information into and out of your programs is essential for making them useful! This concept is often called "I/O" — short for "input/output" — and it encompasses reading in and writing files as well as printing information to the console for immediate viewing. In this lab, we will cover several types of I/O using Python's built-in `print()` function with string formatting as well as file I/O with NumPy and pandas.

**New this week:** For 636 students your deliverable for this lab will be a ZIP file containing this notebook and an additional file for deliverable 3 as requested below. Please rename your ZIP file to `<last_name>_lab_07.zip` prior to submission. Submit your ZIP file to Canvas under the Lab 7 assignment. For 436 students your submission remains the same as previous weeks - just the renamed notebook.

## Resources

[Python output formatting](https://docs.python.org/3/tutorial/inputoutput.html)  
[NumPy I/O](https://numpy.org/doc/stable/user/basics.io.genfromtxt.html)  
[pandas I/O](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

## Exercise I: Python's `print()` function and string formatting

In many previous labs you have made use of the `print()` function to show the contents of variables. You've also displayed the values of variables or calculations by simply entering them on their own line at the end of a cell:

In [1]:
#notice that one line below is simply printed, and the other is marked as '[1]:'
x = 5
print(x)
x ** 2

5


25

You've probably noticed that this occurs only if the variable is on the **last** line of the cell. So this doesn't display anything:

In [3]:
x
y = 6

And this only displays the value of the variable placed on the **last** line:

In [4]:
x
y

6

In general, you want to use the `print()` function explicitly to reliably show information to the user. Showing variables by leaving them as the last line in a code cell is for helpful interactive display only — it isn't as robust and it doesn't work everywhere (once you switch from notebooks to scripts, you will have to use `print()` as nothing will be implicitly displayed).

Note that you do not need to provide a string to `print()`. For example, the following all work:

In [5]:
print(5)  # An integer

print(6.893)  # A float

print({'cat': 'liquid'})  # A dictionary

import numpy as np
print(np.array([1, 2, 3]))  # A NumPy array

5
6.893
{'cat': 'liquid'}
[1 2 3]


This works because Python internally converts these objects to strings before displaying them. For more control, however, you'll want to format things into strings before printing them. There are several ways to do this in Python - we'll focus on the way we've been formatting strings in the previous labs and hometworks - "printf style" string formatting. The documentation for this style of string formatting is [here](https://docs.python.org/3/library/stdtypes.html#old-string-formatting) and several examples are below.

In [6]:
print("My dog is %d years old." % 5) # Print an int with %f

print("My dog sheds %f pounds of fur every week" % 2.581987435) # Print a float with %f

print("My dog sheds %.2f pounds of fur every week" % 2.581987435) # Print only the first two decimal digits of a float

print("My dog sheds %e hairs every week" % 3175384) # Print a float in scientific notation with %e

print("My cat eats 3 %s and sheds %.2e hairs per day" % ("birds", 9213898)) # Format with more than one value, notice a string is %s

My dog is 5 years old.
My dog sheds 2.581987 pounds of fur every week
My dog sheds 2.58 pounds of fur every week
My dog sheds 3.175384e+06 hairs every week
My cat eats 3 birds and sheds 9.21e+06 hairs per day


A breakdown of an example format code ("%.2f"):

* `%` signals that we're specifying a format code — you always need this
* `.2` says that we want 2 digits after the decimal
* `f` says we want to format the number as a float

There is a lot you can do with these, but the syntax is tricky so it's best to learn by example. Take a look at more examples [here](https://pyformat.info/) if you're curious. Note that the string formatting style we are using is called the "old" string formatting style on this site.

## Deliverable 1: String formatting <font color='red'>(50 points)</font>

In a **new code cell** below, write **5 print statements** that use the string formatting method discussed above. There is plenty of room for creativity here!

## Exercise II: File I/O with NumPy

Documentation:
* [`loadtxt()`](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html)
* [`genfromtxt()`](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html)
* [`savetxt()`](https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html)

### Reading files

For reading in files using NumPy, there are two main options: `loadtxt()` and `genfromtxt()`. Using `genfromtxt()` allows for more flexibility in terms of handling missing data and handling different data types.

First, let's download a text file called `station.txt` and view its contents:

In [8]:
!curl -O -s http://www.grapenthin.org/teaching/geop501/download/lab07/station.txt  # Download the file
!cat station.txt  # Display the contents

Name Lat Lon Elevation Type Number
ANMO 34.9500 -106.4600 1820 1 2
BAR 34.1500 -106.6280 2121 1 3
BMT 34.2750 -107.2600 1987 1 4
CAR 33.9525 -106.7340 1658 1 5
CBET 32.4200 -103.9900 1042 1 6
CL2B 32.2300 -103.8800 2121 1 7
CL7 32.4400 -103.8100 1032 1 8
CPRX 33.0308 -103.8670 1356 1 9
DAG 32.5913 -104.6910 1277 1 10
GDL2 32.2003 -104.3640 1213 1 11
HTMS 32.4700 -103.6000 1192 1 12
LAZ 34.4020 -107.1390 1878 1 13
LEM 34.1660 -106.9720 1698 1 1
LPM 34.3117 -106.6320 1737 1 14
MLM 34.8100 -107.1450 2088 1 15
SBY 33.9752 -107.1810 3230 1 16
SMC 33.7787 -107.0190 1560 1 17
SRH 32.4914 -104.5150 1276 1 18
SSS 32.3500 -103.4100 1072 1 19
Y22A 33.9370 -106.9650 1674 1 20
Y22D 34.0739 -106.9210 1436 1 21
WTX 34.0722 -106.9460 1555 1 22


The output you see above is what you'd see if you opened `station.txt` in a text editor. This gives you an idea of how the text file is organized. Here are two examples which read this file in using NumPy functions:

In [9]:
# This approach skips the first line of the file, which contains column names, using "skip_header=1"
example_array = np.genfromtxt('station.txt', encoding='utf-8', dtype=None, delimiter=' ', skip_header=1)

# This approach actually uses the header line to name the output variables, using "names=True"
example_array = np.genfromtxt('station.txt', encoding='utf-8', dtype=None, delimiter=' ', names=True)

# Now this gives you a NumPy array of the values of the "Name" column
print(example_array['Name'])

['ANMO' 'BAR' 'BMT' 'CAR' 'CBET' 'CL2B' 'CL7' 'CPRX' 'DAG' 'GDL2' 'HTMS'
 'LAZ' 'LEM' 'LPM' 'MLM' 'SBY' 'SMC' 'SRH' 'SSS' 'Y22A' 'Y22D' 'WTX']


Note that `delimiter=' '` specifies what is separating the columns — in this case, a [blank space](https://www.youtube.com/watch?v=e-ORhEE9VVg&ab_channel=TaylorSwiftVEVO). A file with commas as delimiters instead would have lines like this:
```
ANMO,34.9500,-106.4600,1820,1,2
```
This is commonly known as a "comma-separated value" file or CSV file. Sound familiar?

### Writing files

To write a file using NumPy, you can use `savetxt()` to save arrays using defined formats. An example to write out the `example_array` to a new file called `file_out.txt` is below. In this example, `fmt` is defining the format for each row in `example_array` to be saved to the file. This should look very familiar. The syntax is exactly the same as the print() formatting from earlier in the lab.

In [10]:
np.savetxt('file_out.txt', example_array, fmt='%s %f %f %i %i %i')

!cat file_out.txt  # View the file contents

ANMO 34.950000 -106.460000 1820 1 2
BAR 34.150000 -106.628000 2121 1 3
BMT 34.275000 -107.260000 1987 1 4
CAR 33.952500 -106.734000 1658 1 5
CBET 32.420000 -103.990000 1042 1 6
CL2B 32.230000 -103.880000 2121 1 7
CL7 32.440000 -103.810000 1032 1 8
CPRX 33.030800 -103.867000 1356 1 9
DAG 32.591300 -104.691000 1277 1 10
GDL2 32.200300 -104.364000 1213 1 11
HTMS 32.470000 -103.600000 1192 1 12
LAZ 34.402000 -107.139000 1878 1 13
LEM 34.166000 -106.972000 1698 1 1
LPM 34.311700 -106.632000 1737 1 14
MLM 34.810000 -107.145000 2088 1 15
SBY 33.975200 -107.181000 3230 1 16
SMC 33.778700 -107.019000 1560 1 17
SRH 32.491400 -104.515000 1276 1 18
SSS 32.350000 -103.410000 1072 1 19
Y22A 33.937000 -106.965000 1674 1 20
Y22D 34.073900 -106.921000 1436 1 21
WTX 34.072200 -106.946000 1555 1 22


## Exercise II: File I/O with pandas

Documentation:
* [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
* [`read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
* [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
* [`to_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)

### Reading files

We spent time in the last lab working with DataFrames in the pandas library, so it's worth knowing how to bring data from files directly into a DataFrame. It is fairly straightforward to bring this in, similar to what we've used in the earlier sections, although pandas allows us to bring in data from text files and Excel (as well as a lot of other file formats). Try the following:

In [11]:
# RUN THIS CELL BEFORE ANYTHING ELSE IN THIS SECTION
!pip install openpyxl



In [12]:
import pandas as pd
station_df = pd.read_csv('station.txt', sep=' ', header=0)
station_df

Unnamed: 0,Name,Lat,Lon,Elevation,Type,Number
0,ANMO,34.95,-106.46,1820,1,2
1,BAR,34.15,-106.628,2121,1,3
2,BMT,34.275,-107.26,1987,1,4
3,CAR,33.9525,-106.734,1658,1,5
4,CBET,32.42,-103.99,1042,1,6
5,CL2B,32.23,-103.88,2121,1,7
6,CL7,32.44,-103.81,1032,1,8
7,CPRX,33.0308,-103.867,1356,1,9
8,DAG,32.5913,-104.691,1277,1,10
9,GDL2,32.2003,-104.364,1213,1,11


You now have a DataFrame called `station_df` that contains all the information from the file `station.txt`. You can then work with the DataFrame as we discussed last week, pulling out values in the named columns as needed, using indexing, labels, etc. Note how similar the structure of the DataFrame is to the file structure (compare to the `!cat station.txt` cell above). This is very handy!

Importing data from Excel files is similarly easy:

In [13]:
!curl -O -s http://www.grapenthin.org/teaching/geop501/download/lab07/station.xlsx  # Download an Excel file

station_df_excel = pd.read_excel('station.xlsx')
station_df_excel

Unnamed: 0,Name,Lat,Lon,Elevation,Type,Number
0,ANMO,34.95,-106.46,1820,1,2
1,BAR,34.15,-106.628,2121,1,3
2,BMT,34.275,-107.26,1987,1,4
3,CAR,33.9525,-106.734,1658,1,5
4,CBET,32.42,-103.99,1042,1,6
5,CL2B,32.23,-103.88,2121,1,7
6,CL7,32.44,-103.81,1032,1,8
7,CPRX,33.0308,-103.867,1356,1,9
8,DAG,32.5913,-104.691,1277,1,10
9,GDL2,32.2003,-104.364,1213,1,11


### Writing files

pandas DataFrames have `to_csv()` and `to_excel()` methods built in. To write `station_df` to a CSV file, for example, all we have to do is:

In [14]:
station_df.to_csv('station.csv', index=False)
!cat station.csv  # View the resulting file

Name,Lat,Lon,Elevation,Type,Number
ANMO,34.95,-106.46,1820,1,2
BAR,34.15,-106.628,2121,1,3
BMT,34.275,-107.26,1987,1,4
CAR,33.9525,-106.734,1658,1,5
CBET,32.42,-103.99,1042,1,6
CL2B,32.23,-103.88,2121,1,7
CL7,32.44,-103.81,1032,1,8
CPRX,33.0308,-103.867,1356,1,9
DAG,32.5913,-104.691,1277,1,10
GDL2,32.2003,-104.364,1213,1,11
HTMS,32.47,-103.6,1192,1,12
LAZ,34.402,-107.139,1878,1,13
LEM,34.166,-106.972,1698,1,1
LPM,34.3117,-106.632,1737,1,14
MLM,34.81,-107.145,2088,1,15
SBY,33.9752,-107.181,3230,1,16
SMC,33.7787,-107.019,1560,1,17
SRH,32.4914,-104.515,1276,1,18
SSS,32.35,-103.41,1072,1,19
Y22A,33.937,-106.965,1674,1,20
Y22D,34.0739,-106.921,1436,1,21
WTX,34.0722,-106.946,1555,1,22


Note that we used `index=False` above to avoid writing the "index" column which is the first column by default in any DataFrame.

## Deliverable 2: Practice with CSV file I/O <font color='red'>(50 points)</font>

For this deliverable, you'll need to get the file `Bogoslof_SO2_per_event.csv` To do this, execute the following code cell:



In [15]:
!curl -O -s https://raw.githubusercontent.com/uafgeoteach/GEOS636_PAG/master/labs/Bogoslof_SO2_per_event.csv

### About the file

This CSV file contains information about the volcanic gases emitted by Bogoslof volcano during a series of more than 30 explosive eruptive events occurring in 2016–2017. Bogoslof Island is located in the southern Bering Sea (north of the Aleutian volcanic arc). From the Alaska Volcano Observatory (AVO) website:

> Bogoslof Island is the largest of a cluster of small, low-lying islands comprising the emergent summit of a large submarine stratovolcano.

The island itself is highly dynamic due to the eruptive and erosional processes constantly shaping it — here's a view from August 2017, courtesy Dave Withrow (NOAA/Fisheries).

![Oblique view of Bogoslof Island](https://www.avo.alaska.edu/images/dbimages/display/1503799877.jpg)

For each eruptive event, AVO calculated the mass of sulfur dioxide (SO<sub>2</sub>) emitted using satellite measurements. The provided file has columns of event number, eruption onset time, mass of SO<sub>2</sub> emitted (in kt), time of the satellite SO<sub>2</sub> measurement, and volcanic plume height. Note that 1 kt (kiloton) = 1000 metric tons, and 1 metric ton = 1000 kg.

### Your task

In a **new code cell below**, read this CSV into Python using pandas, modify the SO<sub>2</sub> mass column so that the units are in kg (and rename the column header to match!) and write out a new CSV file that reflects this modification.

**Notes:**

* Demonstrate that your output file reflects the requested modifications by typing `!head <filename>` where `<filename>` is the name of your output CSV file. This will show the first few lines of the CSV file.
* To rename a column of a pandas DataFrame, you may use the syntax: `df = df.rename(columns=<rename_dict>)` where `df` is your DataFrame and `<rename_dict>` is a dictionary with keys (strings) specifying the current column names and values (strings) specifying the desired names.
* Remember to use `index=False` when writing the new CSV to avoid adding the index column!

## Deliverable 3: I/O with your own file <font color='red'>Only Required for 636 Students (50 points)</font>

For this last deliverable, we want you to practice on a file relevant to your own research. Find a text/CSV/Excel file associated with your research, and do the following:

1. Upload the file to OpenSARlab (see instructions on canvas page)
2. Read the file in using NumPy or pandas
3. Modify the file somehow, like we did in D2 — change the units, add or remove columns, and a row, etc. using NumPy and/or pandas tools
4. Write out the modified contents to a new file **of the same type as the input file** — that means Excel in, Excel out, for example!
5. To submit this lab make a zip of the input file and this notebook - upload the zip to Canvas. 


Steps 2–4 should take place in a **new code cell** below.

---

🚨 **636 students - For this lab, you MUST submit a ZIP file containing two items: This notebook and your input file for D3!** 🚨

**Note:** If you can't find any candidate file related to your research, find an interesting text/CSV/Excel file on the internet, download it, and use it here. But we prefer for you to use something relevant to your research if at all possible!