# Introduction to Python - Lecture 07 (24 Oct 2018)

### Agenda for today:
+ m/z calculation example from lecture 05
+ similarity matching example from lecture 06
+ Working with files and filesystem:
    - basics of file handling in python
    - Example - weblog analytics
+ log aggregation example

# Data Persistence

+ Files
    + **<font color='blue'>\*.txt**</font>, \*.xml, *.json
    + \*.csv, \*.tab, *.xlsx (covered later, with pandas)
+ Databases (covered later in thhe course), when you want to capture relationships between data / entities

# Built-in *<font color='blue'>file</font>* object

+ Basic format:
```python
fh = open('<filename>', '<mode>')   # Creates a file object fh
```
+ *filename* can be _**absolute**_ or _**relative**_
+ *mode*: {'r', 'w', 'a'}; Default='r'
+ 'r': open file for reading if exists, else **<font color='blue'>FileNotFoundError</font>**
+ 'w': open new file for writing; overwrite if exists; use 'a' to avoid overwriting

```python
fh = open('data/data.txt')
type(fh)
dir(fh)
```

+ If the file does not exist, open will raise a **FileNotFound** error with traceback

**Notes**:
+ *fh* is not the file itself, but a handle/reference to it. Use it to do desired operations (read/write).
<br />  
![alt text](filehandle.svg)
<br />
+ <font color='blue'>Some additional mode options: 'rb', 'wb' for reading and writing binary files; '+' to open the file for both reading and writing.

# Reading from Files in "text mode"
+ File content is always read in as strings  
<br />
+ Here are the **most common approaches**:

    - **Read all data at once as a string**

    ```python
    import pprint
    fh = open('data/data.txt', 'r')
    data = fh.read()              # to read in all data as one big string
    print(type(data), '\n\n')
    print(data)
    print('\n', data.split('\n'))   # split the big string on new line character (\n); '\r\n' on windows
    ```

+ **file pointers and *<font color='blue'>seek</font>* operation**

    ```python
data_read_again = fh.read()        # can't read more without resetting the read pointer
print("length of data_read_again is: ", len(data_read_again))
fh.seek(0, 0)          # reset the pointer to beginning of the file: fh.seek(offset, from_what)
                       # https://docs.python.org/3/tutorial/inputoutput.html
data_now = fh.read()
print(data_now)
```

+ **Read individual lines as strings**
```python
fh.seek(0, 0)
data = fh.readlines()       # returns list of strings
print(type(data), "\n\n")   # check out the \n newline character at the end of lines
print(data)         # (\r\n on windows machines)
```

+ **Iterate over large files**
```python
fh.seek(0, 0)
for line in fh:           # 'fh' is iterable; use in iteration context for efficient
    print(len(line), line)  # reading of large files
fh.close()                # close file; good practice (esp. when writing files)
```

+ **Context manager**
```python
with open('data/data.txt', 'r') as fh:  # Context-manager; automatically closes the file
    for line in fh:
        print(line)
print('\nFile closed? : ', fh.closed)
```

# Writing to Files in "text mode"
+ Like reading, writing is also done as strings
<br />
+ Here are the **most common approaches**:

```python
fh = open('data/fresh.txt', 'w')           # Open a new file in write mode
fh.write('This is the 1st line\n')    # Write a line; 
                                      # Note that newline chars must be explicitly added
fh.close()
```



+ **Flushing buffers**

```python
fh = open('data/fresh.txt', 'a')           # Open the earlier file in 'append' mode
                                      #    to avoid overwriting
fh.write('This is the 2nd line\n')    # Write a line; 
                                      # Note that newline chars must be explicitly added
fh.flush()                            # Clears the buffer
```

+ **Write multiple lines at once**

```python
fh.writelines(['This is the 3rd line\n', 'This is the 4th line\n'])   # Note the newline
fh.flush()
```

+ **Write iteratively**

```python
more_lines = ['5th line', '6th line', '7th line']

for line in more_lines:       # iteration context
    fh.write(line + '\n')
    fh.flush()                # Don't need to add flush after every write: inefficient; let python handle it
fh.close()
```



### There are specialized modules to work with specific file formats:
1. csv (comma-separated values)
2. xlrd (excel documents): **pip install** xlrd OR **conda install** xlrd
3. json (hierarchical text-based format like python dictionares)
4. yaml (another hierarchical text-based format like json): **pip install** pyyaml OR **conda install** pyyaml
5. pandas (works with csv, tsv, xls and many more formats): **pip install pandas** OR **conda install pandas**

## os.path module
```python
import os
dir(os.path)  # Note that path is another module that os module imports. 
              # When we import os, path becomes available as a module variable within os namespace.
              # --> module / namespace hierarchy
```
+ **path parsing:**
    - os.path.split(<path_str>):
    - os.path.splitext(<path_str>) 
```python
# Ex.
print(os.path.split('/Users/groveh01/Documents/my_data.txt'))
print(os.path.splitext('my_data.txt'))
```
+ **path building:**
    - os.path.join(<path_components>)
```python
# Ex.
print(os.path.join('/Users', 'groveh01', 'Documents', 'Teaching'))
```
+ **common tests:**
    - os.path.<test>, where test = {isdir(), isfile(), exists(), ...}
```python
# Ex.
print(os.path.isdir('data/data.txt'))
print(os.path.isfile('data/data.txt')
print(os.path.exists('data')
print(os.path.exists('data.txt')
print(os.path.exists('/Users/groveh01')
```
+ **listing contents of a dir:**
    - os.listdir
```python
# Ex.
import pprint
pprint.pprint(os.listdir('/Users/groveh01'))
pprint.pprint(os.listdir('.'))
pprint.pprint(os.listdir('..'))
```