# Files

_(c) 2022, Mark van den Brand and Lina Ochoa Venegas, Eindhoven University of Technology_

## Table of Contents
- [1. Introduction](#introduction)
- [2. Persistence](#persistence)
- [3. Opening Files](#opening-files)
- [4. Reading Files](#reading-files)
- [5. Searching through a File](#searching-file)
- [6. File Names and Paths](#names-paths)
- [7. Let the User Choose the File](#choose-file)
- [8. Writing Files](#writing-files)
- [9. The `with` Statement](#with)
- [10. File Formats](#file-formats)

## 1. Introduction <a class="anchor" id="introduction"></a>

This Jupyter Notebook discusses how we can store data in a *persistent* way. Storing data in a *persistent* way means that we can retrieve the
data at a later moment in time for inspection or further computations. 
Programs that do not manipulate data do not make much sense, so developing serious software always involves data manipulation or inspection.
The storage of data is as important as the manipulation of data and this can be done via files and databases.
It is not difficult that data storage in banks, insurrance companies, hospitals, companies providing social media services, etc. are crucial.

We have already seen how we can open a file and read the content of a file, but storing data to process it later is indispensible. This functionality enables us to store data at some point in time and retrieve it at a later point in time.

## 2. Persistence <a class="anchor" id="persistence"></a>

The programs we have seen and that you have written, process data (input values) and print the results.
The produced data evaporates, however, in real life this is not what we want. 
Banks, webshops, etc. do not want to lose their data about customers, their shopping history, etc. Actually from a data science point of view this data is extremely important, because via this data trends, etc. can be predicted.

Programs that store their data are **persistent**. 

Some programs are launched and they start with reading in (stored) data and continue with this data. Before terminating the new data is stored again.
Other programs run "forever" and store their data in a persistent way (on disk) in order to ensure that no data is lost.

One of the simplest ways for programs to maintain their data is by reading and writing (text) **files**. 
Another way is to store data in a **database**.

## 3. Opening Files <a class="anchor" id="opening-files"></a>

When we aim at opening a file, what we are really doing is asking the Operating System (OS) to find a file by name and verify that it exists.
We use the built-in function `open` in Python to achieve this task.

In [None]:
file_out = open('datasets/output.txt')

If the function `open` is successful, the OS returns a **file handler**.
The file handler is not the real data, but instead, it is an intermediary that we can use to read from or write to the file.

A text file can be seen as a **sequence of lines**.

Python considers a special character to break the text into lines. This special character is known as the **newline** character, which represents the end of the line.

The newline character is represented as `\n`. If you include this character in a string, the content after the newline character will be displayed in a new line.

In [None]:
print('Data\nScience')

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Can you print the text: <br><br>
    "I am <br>
    a Data Science student" <br><br>
    while using the newline character?
</div>

In [None]:
# Remove this line and add your code here

## 4. Reading Files <a class="anchor" id="reading-files"></a>

To read a file we rely on the `open` function.
This function does not read the **entire content** of the file once it is called. This happens mainly because the file might be **too large** to keep it in main memory.
Thus, the `open` function takes the same amount of time to execute regardless of the size of the file.

Once we call the `open` function, we get a file handler that can be used within a `for` loop to read each line of the file.
In this case, Python is in charge of splitting the content of the file into **separate lines**.
With the `for` loop we can efficiently read a file of any size, because each line is read, counted, and then discarded.

The following code creates a **file handler** (`file_logs`) and counts the number of lines in the file.

In [None]:
file_logs = open('datasets/logs.txt')
count: int = 0

for line in file_logs:
    count += 1
    
print(count)

If you know that the size of the file is **small** with reference to the size of your main memory, you can use the `read` method on the file handler.
This method reads the whole content of the file as a **big string** including all line and newline characters.

It is a good idea to store the output of the `read` method in a variable, given that it **exhausts resources** (once resources are read, no more content can be obtained in a future invocation).

In [None]:
file_logs = open('datasets/logs.txt')
first_call: str = file_logs.read()
second_call: str = file_logs.read()

print('First call:\n' + first_call)
print('Second call:\n' + second_call)

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Read the <i>logs.txt</i> file and print its lines one by one.
</div>

In [None]:
# Remove this line and add your code here

## 5. Searching in a File <a class="anchor" id="searching-file"></a>

It is quite common to search for specific or interesting lines in a file, and skip the ones that do not meet a given condition. 
For instance, we can use the `startswith` string method to print the lines that start by "INFO".

In [None]:
file_logs = open('datasets/logs.txt')

for line in file_logs:
    if line.startswith('INFO'):
        print(line)

We have filtered the information. 
But, why are we seeing the extra blank line between lines?

This is related to the invisible newline character present in all lines. Thus, the `print` function prints all lines with this newline character and it also adds an **additional newline character**, resulting in double spacing.

To improve the asthetics of our output we can use the `rstrip` method, which strips whitespaces from the end of a string.

In [None]:
file_logs = open('datasets/logs.txt')

for line in file_logs:
    if line.startswith('INFO'):
        print(line.rstrip())

As your programs get more complex, you would like to use the `continue` statement to filter out uninteresting lines.

In [None]:
file_logs = open('datasets/logs.txt')

for line in file_logs:
    if not line.startswith('INFO'): 
        continue
    print(line.rstrip())

In the previous code, we use the contracted version of the `if` statement. That is why we place the `continue` statement in the same line.

We can also use the `find` string method to look for a string in a given line of the file. 
This method returns the **position** of the string or `-1` if the string is not found.

Let us print now all lines that contain the text "from:bob@mail.nl".

In [None]:
file_logs = open('datasets/logs.txt')

for line in file_logs:
    if line.find('from:bob@mail.nl') == -1: 
        continue
    print(line.rstrip())

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Read the <i>logs.txt</i> file and count the ERROR lines.
</div>

In [None]:
# Remove this line and add your code here

## 6. File Names and Paths <a class="anchor" id="names-paths"></a>

Files are organized into **directories** (also called “folders”). 
Every running program has a **current directory**, which is the default directory for most operations. 
For example, when you open a file for reading, Python looks for it in the current directory.

The `os` module provides functions for working with files and directories (“os” stands for
“operating system”). 

`os.getcwd` returns the name of the current directory.

In [None]:
import os

cwd: str = os.getcwd()
print(cwd)

A string like `/path/to/user/path/to/course-material-jbi010/lectures/week3` that identifies a file or directory is called a **path**. 

<div class="alert alert-info">
    <b>Different path per OS</b><br>
    Every OS may have a different way of representing paths.
</div>

A simple filename, like `words.txt` is also considered a path, 
but it is a **relative path** because it relates to the current directory. 

For instance, if the current directory is `/path/to/user/path/to/course-material-jbi010/lectures/week3`, the filename
`output.txt` would refer to `/path/to/user/path/to/course-material-jbi010/lectures/week3/output.txt`.

A path that begins with `/` does not depend on the current directory; it is called an **absolute
path**. 
It shows the path from the root folder of your system.
To find the absolute path to a file, you can use `os.path.abspath`.

In [None]:
os.path.abspath('datasets/output.txt')

`os.path` provides other functions for working with filenames and paths. 

`os.path.exists` checks whether a file or directory **exists**.

In [None]:
os.path.exists('datasets/output.txt')

If it exists, you can use `os.path.isdir` to check whether it **is a directory**.

<div class="alert alert-info">
    <b>Running cells with absolute paths</b><br>
    We use an absolute path in the next cells, please change accordingly if you want to run the cells.
</div>

In [None]:
os.path.isdir('datasets/output.txt')

In [None]:
os.path.isdir('/home/lina/Documents/tue/courses/jbi010/course-material-jbi010/202223/lectures/week3')

`os.listdir` returns a **list of the files** (and other directories) in the given directory.

In [None]:
os.listdir(cwd)

The application of these `os` functions are demonstrated in the following function, which iterates over a directory and prints all files and (sub)directories.

`os.path.join` takes a directory and a file name and joins them into a **complete path**.

In [None]:
def content(dir_name: str) -> None:
    """
    Prints all files in subdirectories starting from dir_name.
    :param dir_name: starting directory
    """
    current_directories: str = os.listdir(dir_name)
    
    for name in current_directories:
        path: str = os.path.join(dir_name, name)
        
        if os.path.isfile(path):
            print(f'File is: {path}')
        else:
            print(f'Path is: {path}')
            
content('/home/lina/Documents/tue/courses/jbi010/course-material-jbi010/202223/lectures')

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Call the <i>content()</i> function in one of the folders of your file system.
</div>

In [None]:
# Remove this line and add your code here

The `os` module provides a function called `walk` that walks over all subdirectories.


In [None]:
for path in os.walk('/home/lina/Documents/tue/courses/jbi010/course-material-jbi010/202223/lectures'):
    print(path)

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Use the <i>walk()</i> method to check the content of a folder in your file system.
</div>

In [None]:
# Remove this line and add your code here

## 7. Let the User Choose the File <a class="anchor" id="choose-file"></a>

Sometimes we would like to let the user **choose the file** he or she wants to open. In this way, we do not need to modify our code every time we require a different file.

To do so, we can use the `input` function as follows.

In [None]:
file = input('Enter the file name:')
file_handle: str = open(file)

print(file)

But what can go wrong with this code?

### Catching Exceptions

It might happen that our users or ourselves input a wrong path to our file, which will result in an **exception**.

In general, the reading and writing of files is error-prone. 
The file may not exist, it may be read or write protected, etc.
If you try to open a file that does not exist, you get a `FileNotFoundError`.

In [None]:
fbad = open('bad.txt')

If you are not allowed to open a file, you get a `PermissionError`.

In [None]:
file_out = open('/etc/passwd', 'w')

If you try to open a directory, you get another another `FileNotFoundError`.

In [None]:
file_in = open('/users')

You could use functions like `os.path.exists` and `os.path.isfile` to prevent these type of errors.

However, a lot of (subtle) errors may happen when doing file-IO and thus a lot of code
may be involved to make it full proof.

If “`Errno 21`” is any indication, there are at least 21 things that can go wrong.
It is better to go ahead and try—and deal with problems if they happen—which is exactly
what the `try` statement does. 

In [None]:
try:
    fbad: str = open('bad_file')
except FileNotFoundError as e:
    print("Didn't find the file!")

Python starts by executing the `try` clause. 
If all goes well, it skips the `except` clause and proceeds. 
If an exception occurs, it jumps out of the `try` clause and executes the `except` clause.

Remember that tangling an exception with a try statement is called **catching an exception**.  
In general, catching an exception gives you a chance to fix the problem, or try again, or at least end the program gracefully.

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Handle the exception in the following code.
</div>

In [None]:
# Modify the following code
file = input('Enter the file name:')
file_handle = open(file)

print(file)

## 8. Writing Files <a class="anchor" id="writing-files"></a>

To write a file you should open it with mode `w` (from *write*) as second parameter.

In [None]:
file_out = open('datasets/output.txt', 'w')

If the file already exists, opening it in write mode removes the current content from the file, *so be careful!* 
If the file does not exist, a new one is created.

`open` returns a file object that provides methods (`write` and `close`) for working with the file. 
The `write` method puts data into the file.

In [None]:
line1: str = 'There are bright Data Scientists\n'
file_out.write(line1)

The output number indicates how many characters are written to the file.

You need to explicitly add newline characters when using the `write` method. Contrary to the `print` function, the `write` method does not automatically add newlines at the end of strings.

The file object keeps track of the position the next data is to be written.
If you write more data to the file, it will be **appended**.

In [None]:
line2: str = 'But there are also bright Computer Scientists\n'
file_out.write(line2)

When you are done writing, you must `close` the file. 
Writing a file is what we know as a **buffered operation**, meaning that as long as you do not close the file, no information is actually written to the file.
Instead, the data is temporarily stored in a **buffer**, a memory region used to store temporal data that is being moved from one place to another.
If you do not close the file, you will keep using unnecesary memory.
When your program is terminated it will try to close the file, but beware, part of your data might be lost!

<div class="alert alert-info">
    <b>Close files as a good practice</b><br>
    Similar to type hints and comments, make it common practice to <b>close files explicitly</b>, because different Python environments may have different behaviour in this respect.
</div>

In [None]:
file_out.close()

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Copy the content of <i>logs.txt</i> in the <i>output.txt</i> file.
</div>

In [None]:
# Remove this line and add your code here

## 9. The `with` Statement <a class="anchor" id="with"></a>

In the previous section, we saw that we need to explicitly close a file, otherwise you will keep using unneeded memory.
This problem is known as a **memory leak**: the memory of your computer decreases every time you create and open a new resource without closing it.
However, remembering every time that you need to close a file is error-prone.

Python offers the `with` statement as a way to cope with human errors related to the handling of these resources (e.g. files, network connections).
This statement generates a temporal context (called **runtime context**) that runs all the statements defined in its body.
In addition, a **context manager** will take care of the resource handling.

The syntax of this statement is as follows:

```python
with expression as ctx_manager:
    # Body
```

The context manager (`ctx_manager`) results from evaluating the `with` expression.
It is in essence an *object* that implements the **context management protocol**.

<div class="alert alert-info">
    <b>More about context managers</b><br>
    If you want to know more about <b>context managers</b> and their protocol, please visit <a href="https://book.pythontips.com/en/latest/context_managers.html">this link</a>.
</div>

We can change the way how we open our files as follows.

In [None]:
with open('datasets/output.txt', 'w') as file_out:
    line1: str = 'There are bright Data Scientists\n'
    line2: str = 'But there are also bright Computer Scientists\n'
    file_out.write(line1)
    file_out.write(line2)

Notice that the `open('datasets/output.txt', 'w')` expression creates a context manager (i.e. our file handler).
The output of this expression is saved in the variable `file_out`.
Afterward, we encapsulate all the file-management related statements within the body of the `with` statement.
The important part now is that we don't need to call the `close` method anymore!
The context manager will take care of that for us.

Using the `with` statement can actually make your code more **readable and safer**.

## 10. File Formats <a class="anchor" id="file-formats"></a>

There are a plethora of formats used to store data. Some of the most popular formats are *plain text (txt)*, *Comma-Separated Values (CSV)*, *JavaScript Object Notation (JSON)*, and *Extensible Markup Language (XML)*. 

### Plain Text
So far, we have been interacting with **plain text** files–with the `.txt` extension–, representing characters written using a specific encoding such as ASCII or Unicode. A
n **encoding** is a system that represents each character as a number for digital representation. 
(Have a look at the `logs.txt` file.)

In [None]:
# Read the first line of the log.txt file

with open('datasets/logs.txt') as file:
    print(file.readline()) 

### Comma-Separated Values (CSV)

The **Comma-Separated Values (CSV)** format–with the `.csv` extension–uses a comma, semicolon, or another character to separate values in a file. 
Each line in the file reports a case or record, and each value within the record is an observation for a given variable. 
Each record should have the same number of observations. 
CSV files usually store rectangular data in plain text and they look as follows.
(Also have a look at the `logs.csv` file.)

```
variable_1, variable_2, ..., variable_n
row_1_value_1, row_1_value_2, ..., row_1_value_n
row_2_value_1, row_2_value_2, ..., row_2_value_n
row_3_value_1, row_3_value_2, ..., row_3_value_n
```

In Pyhon, we can use the `csv` library to interact with CSV files. 
We can use the `reader` function to create a reader object that will iterate over the rows in the CSV file. 
We can access each value using an index–based on the position of the variable in the file. 
Notice that the header of the data file will also be printed if you do not exclude it explicitly.

In [None]:
import csv

with open('datasets/logs.csv') as file:
    reader = csv.reader(file, delimiter=',') # Default value of delimiter is ','
    
    for row in reader:
        typ = row[0]        # We write typ because type is a reserved word
        timestamp = row[1]
        from_email = row[2]
        to_email = row[3]
        message = row[4]
        
        print(f'{typ} {message} [{timestamp}] from:{from_email} to:{to_email}')

However, indexing is error-prone. 
We can instead use the `DictReader` function that will create a dictionary for each record in the file. 
You can access specific values with the name of the variable–declared in the first row of the file. 
In this case, you do not need to manage the header row, the `csv` library will do it for you.

In [None]:
import csv

with open('datasets/logs.csv') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        typ = row['type']            # We write typ because type is a reserved word
        timestamp = row['timestamp']
        from_email = row['from']
        to_email = row['to']
        message = row['message']
        
        print(f'{typ} {message} [{timestamp}] from:{from_email} to:{to_email}')

### JavaScript Object Notation (JSON)

The **JavaScript Object Notation (JSON)** is a file format–with the `.json` extension–used to store data objects represented as attribute-value pairs. 
We can think of it as an object that usually contains data in the form of lists and dictionaries. 
This format is specially popular when there is data transfer between web applications and servers. 
JSON files look as follows. 
(Also have a look at the `logs.json` file.)

```
{
    "key_1" : {
        "key_1_1" : "value_1",
        "key_1_2" : "value_2",
        "key_1_3" : [...],
        ...
    },
    ...
}
```

JSON supports the following formats:

* **Object:** unordered set of name-value pairs. An object starts with `{` and ends with `}` (similar to a dictionary in Python).
* **Array:** sequence of values. An array begins with `[` and `]` (similar to a list in Python). 
* **Value:** can be a string (in double quotes), a number, `true`, `false`, `null`, an object, or an array. 

In Python, we can use the `json` library to interact with JSON files.
We can use the `load` function which creates a Python object based on [this conversion table](https://docs.python.org/3/library/json.html#json-to-py-table). 
You can then interact with the data as you normaly do in Python.

In [None]:
import json

with open('datasets/logs.json') as file:
    data = json.load(file)  # We get a dictionary out from the JSON file
    
    for obj in data:                 # Iterate over all objects (dictionaries) in the array (list)
        typ = obj['type']            # We write typ because type is a reserved word
        timestamp = obj['timestamp']
        from_email = obj['from']
        to_email = obj['to']
        message = obj['message']
        
        print(f'{typ} {message} [{timestamp}] from:{from_email} to:{to_email}')

### Extensible Markup Language (XML)

The Extensible Markup Language (XML) is a markup language and data format used to represent and store arbitrary data structures. 
The XML language was formalized in the World Wide Web Consortium's XML 1.0 Specification of 1998. 
It consists of a set of structured *tags*, each one with zero or more *attributes*, and an *element* (characters in between the opening and ending tags). 
XML files look as follows. (Also have a look at the `logs.xml` file.)

```
<tag_1>
    <tag_1_1 attr_1="value_1" attr_2="value_2" ...>
        element_1_1
    </<tag_1_1>
    <tag_1_2>
        element_1_2
    </<tag_1_2>
    ...
</tag_1>
```

In Python, we can use the `xml.etree.ElementTree` library to interact with XML files. 
We use the `parse` function to read and parse an XML file. 
The output of the function is a tree. 
Trees are out of the scope of this course, but you should know that these are frequent data structures used in computer science and data science. 
You can traverse the tree to get relevant information from each of its nodes. 
In the following cell, we show a very simple interaction with the tree structure.

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('datasets/logs.xml')
root = tree.getroot()                # Get the root node of the tree

for message in root.iter('message'): # Get all tags of type 'message'
    print(message.text)              # and print their text.

---
This Jupyter Notebook is based on Chapter 7 of the book Python for Everybody and Chapter 14 of the book Think Python.

---

# (End of Notebook)

&copy; 2022-2023 - **TU/e** - Eindhoven University of Technology