# 1. Reading data from the desktop file system

The goal of this notebook session is to read a "comma separated value" data file from the computer desktop and load it into a usable Python data structure that can be used for further analysis. In this section, we will explore Python tools for reading a text file line-by-line into an array-like `list`, parsing and cleaning the contents of each line, and then converting the line data elements from string types to their corresponding data types.

The data we will use in this session is from a non-active research site of the *Long Term Ecological Research Network*, called *North Inlet LTER*. The data consist of daily water samples from from 1978 to 1992. This data is available from the *Environmental Data Initiative* (EDI) [data repository](https://portal.edirepository.org/nis) under the repository identifier [knb-lter-nin.1.1](https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-nin&identifier=1).

#### References

 1. Python `with`
    1. [Python 3 reference](https://docs.python.org/3/reference/compound_stmts.html#with) 
    1. A `with` [Anti-pattern](https://docs.quantifiedcode.com/python-anti-patterns/maintainability/not_using_with_to_open_files.html) or when/how `with` should be used.
 1. Python `list`
     1. [Python 3 reference](https://docs.python.org/3.6/library/stdtypes.html#list)
     1. [Tutorial](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)
     1. `list` [comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
 1. Python`array`
     1. [Python 3 reference](https://docs.python.org/3/library/array.html)
     1. *arrays* **are not** *lists* [stackoverflow](https://stackoverflow.com/questions/176011/python-list-vs-array-when-to-use?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa)
 1. Python `dictionary`
     1.[Python 3 reference](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)
     1.[Tutorial](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)
 1. [Matplotlib](https://matplotlib.org/index.html)

### Download and verify our data file `LTER.NIN.DWS.csv` using BASH shell commands.

In [None]:
!curl -s -X GET https://pasta.lternet.edu/package/data/eml/knb-lter-nin/1/1/DailyWaterSample-NIN-LTER-1978-1992 > LTER.NIN.DWS.csv

In [None]:
!head -n 10 ./LTER.NIN.DWS.csv

In [None]:
!tail -n 10 ./LTER.NIN.DWS.csv

In [None]:
!wc -l ./LTER.NIN.DWS.csv

### Read the data table file `LTER.NIN.DWS.csv`and load into a multi-dimensional Python `list` data structure.

The data file we are using consists of text that is formatted as a "comma separated values" table, with a mixture of column data types, including dates, text, floating point, and interger values. As with most software, Python reads text files a line at a time, as delimitted by the single line feed character `\n`. This line feed is consider white space and should be removed from the end of each line.

Because the file is text, each full line is read and saved as a Python string. In Python 3, strings (or `str`) are composed of Unicode characters. A discussion of Unicode is beyond the scope of this session, but there are plenty of sites on the Internet that provide good information on the subject ([Python 3 reference](https://docs.python.org/3.6/howto/unicode.html?highlight=unicode), [How Python does Unicode](https://www.b-list.org/weblog/2017/sep/05/how-python-does-unicode/), [Pragmatic Unicode, or How do I stop the pain?](https://www.youtube.com/watch?v=sgHbC6udIqc), and [Characters, Symbols and the Unicode Miracle](https://youtu.be/MijmeoH9LT4)).

The Python `with` statement creates a context in which scope-bounded execution can occur. Using the `with` statement for file operations is recommended because the file handle will be closed automatically, even if an exception occurs during the read operation.

The commad in the square brackets `[]` is a Python `list comprehension`, which is generally considered to be more efficient than using a `for` statement and it is more compact.

These two commands will generate a Python `list` data structure, but not quite the one we need for data analysis.

In [None]:
with open('./LTER.NIN.DWS.csv', 'r') as f: # Open text file for reading
    table = [_.strip().split(',') for _ in f.readlines()] # <-- list comprehension

In [None]:
type(table)

In [None]:
len(table)

In [None]:
for head in table[:9]:
    print(head)

In [None]:
for tail in table[-10:]:
    print(tail)

In [None]:
header = table[0] # Set header from original table
print(header)

In [None]:
table = table[1:] # Remove header; extract only string values that represent the real table data
len(table)

### Convert the `str` values into the appropriate data types.

For us to use the data contained within this file, each line must be parsed into its respective data tokens and converted into their real data types (e.g., date, float, integer, ...). There are some good Python packages that can guess at the conversion, but we will manually peform this task since we too can make a good guess as to the data types.

The result of this process will be a multi-dimensional Python `list` data structure that holds the data - you can think of this as a 2-dimensional array. When creating this `list` data structure, we can order the table in two ways: 1) row major or 2) column major.

#### Row major order

*Row major order* means that data are ordered in the same way that they were read in from the data file: as a single line consisting of a single data point from each data column that is store in a row-based `list`.

```
[pnt1-col1, pnt1-col2, pnt1-col3, pnt1-col4,...]
[pnt2-col1, pnt2-col2, pnt2-col3, pnt2-col4,...]
.
.
.
[pntN-col1, pntN-col2, pntN-col3, pntN-col4,...]
```

This may seem more natural when processing the data table for printing or examination, but it does not capture the columnar model of the actual table and results in more work when performing downstream analysis of a single data column because you must iterate through each row and select the data point of interest.

In [None]:
# Populate data frame with coerced (converted) values from data table in row-major order

from datetime import datetime

df = []
for row in table:
    data = []
    date = datetime.strptime(row[0], '%m/%d/%Y')
    data.append(date)             # Date as datetime
    data.append(row[1])          # transect as unicode string
    data.append(float(row[2]))   # water_temp as float
    data.append(float(row[3]))   # SAL as float
    data.append(float(row[4]))   # TNW as float
    data.append(float(row[5]))   # TNF as float
    data.append(float(row[6]))   # TPW as float
    data.append(float(row[7]))   # TPF as float
    data.append(float(row[8]))   # POP as float
    data.append(float(row[9]))   # NHN as float
    data.append(float(row[10]))  # NNN as float
    data.append(int(row[11]))    # CHEM as integer
    data.append(float(row[12]))  # TOC as float
    data.append(float(row[13]))  # DOC as float
    data.append(float(row[14]))  # POC as float
    df.append(data)

In [None]:
# Access the "Date" and "water_temp" columns and plot the data

import matplotlib
import matplotlib.pyplot as plt

date = []
water_temp = []
for row in df:
    date.append(row[0])
    water_temp.append(row[2])
    
fig, ax = plt.subplots()
ax.plot(date, water_temp, label='Water Temp')
ax.grid(True)
fig.autofmt_xdate()
fig.set_size_inches(10, 8)
plt.legend()
plt.show()

#### Column major order

*Column major order* means that the data are ordered within their respective column structure and can be accessed as a single `list`.

```
[pnt1-col1], [pnt1-col2], [pnt1-col3], [pnt1-col4]...
[pnt2-col1], [pnt2-col2], [pnt2-col3], [pnt2-col4]...
.
.
.
[pntN-col1], [pntN-col2], [pntN-col3], [pntN-col4]...
```

Building the column major order `list` data structure requires minimal effort and results in a much more functional data structure for performing column-based data analysis.

In [None]:
# Populate data frame with coerced (converted) values from data table in column-major order

from datetime import datetime

df = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
for row in table:
    date = datetime.strptime(row[0], '%m/%d/%Y')
    df[0].append(date)              # Date as datetime
    df[1].append(row[1])           # transect as unicode string
    df[2].append(float(row[2]))    # water_temp as float
    df[3].append(float(row[3]))    # SAL as float
    df[4].append(float(row[4]))    # TNW as float
    df[5].append(float(row[5]))    # TNF as float
    df[6].append(float(row[6]))    # TPW as float
    df[7].append(float(row[7]))    # TPF as float
    df[8].append(float(row[8]))    # POP as float
    df[9].append(float(row[9]))    # NHN as float
    df[10].append(float(row[10]))  # NNN as float
    df[11].append(int(row[11]))    # CHEM as integer
    df[12].append(float(row[12]))  # TOC as float
    df[13].append(float(row[13]))  # DOC as float
    df[14].append(float(row[14]))  # POC as float


In [None]:
# Access the "Date" and "water_temp" columns and plot the data

import matplotlib
import matplotlib.pyplot as plt

date = df[0]
water_temp = df[2]
    
fig, ax = plt.subplots()
ax.plot(date, water_temp, label='Water Temp')
ax.grid(True)
fig.autofmt_xdate()
fig.set_size_inches(10, 8)
plt.legend()
plt.show()

### Creating a more efficient (storage and speed) data structure using the Python `array` to store data.

Python `array` data structures are closely aligned with the underlying and native "C" data types used in the implementation of the Python language. As such, these data structures are based on true types (not Object types like in Python), so they are much more efficient in storage and can be processed much faster during script execution.

The following code demonstrates how to implement a *column major order* data table similar to the Python `list` data strucutre above. It also uses the Python `dict` data structure, which provides a "key-value" associative capability so that we may refer to our columns using their respective names.

In [None]:
# Define an empty data frame data structure as Python dictionary

import array
df = {
    'Date': [],                      # datetime object
    'transect': [],                  # unicode string
    'water_temp': array.array('d'),  # double float
    'SAL': array.array('d'),         # double float
    'TNW': array.array('d'),         # double float
    'TNF': array.array('d'),         # double float
    'TPW': array.array('d'),         # double float
    'TPF': array.array('d'),         # double float
    'POP': array.array('d'),         # double float
    'NHN': array.array('d'),         # double float
    'NNN': array.array('d'),         # double float
    'CHEM': array.array('l'),        # signed long
    'TOC': array.array('d'),         # double float
    'DOC': array.array('d'),         # double float
    'POC': array.array('d')          # double float
}

In [None]:
# Populate data frame with coerced (converted) values from data table in column-major order

from datetime import datetime

for row in table:
    date = datetime.strptime(row[0], '%m/%d/%Y')
    df['Date'].append(date)                  # Date as datetime
    df['transect'].append(row[1])           # transect as unicode string
    df['water_temp'].append(float(row[2]))  # water_temp as float object
    df['SAL'].append(float(row[3]))         # SAL as float object
    df['TNW'].append(float(row[4]))         # TNW as float object
    df['TNF'].append(float(row[5]))         # TNF as float object
    df['TPW'].append(float(row[6]))         # TPW as float object
    df['TPF'].append(float(row[7]))         # TPF as float object
    df['POP'].append(float(row[8]))         # POP as float object
    df['NHN'].append(float(row[9]))         # NHN as float object
    df['NNN'].append(float(row[10]))        # NNN as float object
    df['CHEM'].append(int(row[11]))         # CHEM as integer object
    df['TOC'].append(float(row[12]))        # TOC as float object
    df['DOC'].append(float(row[13]))        # DOC as float object
    df['POC'].append(float(row[14]))        # POC as float object

In [None]:
len(df['CHEM'])

In [None]:
# Access the "Date" and "water_temp" columns and plot the data

import matplotlib
import matplotlib.pyplot as plt

date = df['Date']
water_temp = df['water_temp']
    
fig, ax = plt.subplots()
ax.plot(date, water_temp, label='Water Temp')
ax.grid(True)
fig.autofmt_xdate()
fig.set_size_inches(10, 8)
plt.legend()
plt.show()