# Welcome to File handling course notes:

This notebook covers details different ways to handle files in Python.

## Introduction: Working with Data files:

In Python, we must open files before we can use them and close them when we are done with them. As you might expect, once a file is opened it becomes a Python object just like all other data.
<br>Below are some functions and methods that can be used to open and close files:

1. open for reading:<br>
<b>Usage:</b> ``fileref = open(filename, 'r')``<br>
Open a file called filename and use it for reading. This will return a reference to a file object (``fileref`` here).
2. open for writing:<br>
<b>Usage:</b> ``fileref = open(filename, 'w')``<br>
Open a file called filename and use it for writing. This will also return a reference to a file object (``fileref`` here).
3. close:<br>
<b>Usage:</b> ``fileref.close()``
File use is complete.

As an example, suppose we have a text file called ``olympics.txt`` that contains the data representing about olympians across different years.

In [9]:
# We would first open the olympics.txt file using open, this would return a reference to a object, we name it fileref
fileref = open('../RefFiles/olympics.txt', 'r')

# We can now perform file read related operations within this block.

# Once we are finished with the file usage, we would close the file using the close method. 
fileref.close()

# After this point, any reference to the fileref object will result in an error.
print(fileref.read())

ValueError: I/O operation on closed file.

Once you have a file “object”, the thing returned by the open function, Python provides three methods to read data from that object:<br>
<strong>1. read()</strong>:<br>
<strong><i>Usage:</i></strong> ``fileref.read(n)``<br>
Read and return a string of ``n`` characters, or the entire file as a single string if ``n`` is not provided.<br>

<strong>2. readline()</strong>:<br>
<strong><i>Usage:</i></strong> ``fileref.readline(n)``<br>
Read and return the next line of the file with all text up to and including the newline character. If ``n`` is provided as a parameter, then only ``n`` characters will be returned if the line is longer than ``n``.<br>
<b>Note:</b> the parameter ``n`` is not supported in the browser version of Python, and in fact is rarely used in practice, you can safely ignore it.<br>

<strong>3. readlines(n)</strong>:<br>
<strong><i>Usage:</i></strong> ``fileref.readlines(n)``<br>
Returns a list of strings, each representing a single line of the file. If ``n`` is not provided then all lines of the file are returned. If ``n`` is provided then ``n`` characters are read but ``n`` is rounded up so that an entire line is returned.<br>
<b>Note:</b> like readline, readlines ignores the parameter ``n`` in the browser.<br>


Both method \#2 and \#3 may contain a newline character (\n) at the end.

<strong> Important Note:</strong>
A common error that novice programmers make is not realizing that all these ways of reading the file contents, use up the file.<br>
After you call readlines(), if you call it again you’ll get an empty list.

In [10]:
# Let's try and use these methods one by one
fref = open('../RefFiles/olympics_short.txt', 'r')
contents = fref.read()                   # Should save all the lines from the file in the string: contents
fref.close()
print("read():\ncontents = ({})".format(contents))

# We would need to open the file again in order to read as the contents have been replinished already and the file is also closed.
fref = open('../RefFiles/olympics_short.txt', 'r')
all_lines = fref.readlines()            # Should save all the lines in the file in the list: all_lines
fref.close()
print("\nreadlines():\nlines = {}".format(all_lines))

# Once again, we would reopen as the contents have been used already by readlines() and the file is also closed.
fref = open('../RefFiles/olympics_short.txt', 'r')
line1 = fref.readline()
line2 = fref.readline()
line3 = fref.readline()
fref.close()

print("\nreadline():\nline1 = ({}),\nline2 = ({}),\nline3 = ({})".format(line1, line2, line3))    # See how the \n chars are being parsed here.

read():
contents = (Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China,Basketball,NA
A Lamusi,M,23,China,Judo,NA
Gunnar Nielsen Aaby,M,24,Denmark,Football,NA
Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA
)

readlines():
lines = ['Name,Sex,Age,Team,Event,Medal\n', 'A Dijiang,M,24,China,Basketball,NA\n', 'A Lamusi,M,23,China,Judo,NA\n', 'Gunnar Nielsen Aaby,M,24,Denmark,Football,NA\n', 'Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold\n', 'Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA\n']

readline():
line1 = (Name,Sex,Age,Team,Event,Medal
),
line2 = (A Dijiang,M,24,China,Basketball,NA
),
line3 = (A Lamusi,M,23,China,Judo,NA
)


We can also write to a file using the ``w`` paramter. To write a line, we use the ``write()`` method on the reference object that refers to a file opened for writing. When we open a file for writing, a new, empty file with that name is created and made ready to accept our data. If an existing file has the same name, its contents are overwritten.

<strong>write()</strong>:<br>
<strong><i>Usage:</i></strong> ``fileref.write(astring)``<br>
Add a string to the end of the file. ``fileref`` must refer to a file that has been opened for writing (using the 'w' parameter instead of 'r').<br>

Be very careful to notice that the write method takes one parameter, a string. It is the programmer’s job to include the newline characters as part of the string if desired.

## Iterating over files: 
A <strong>line</strong> of a file is defined to be a sequence of characters up to and including a special newline character (\n). If you print a string that contains a newline you will not see the \n, you will just see its effects (a carriage return).

Because ``readlines()`` returns a list of lines of text, we can use the for loop to iterate through each line of the file.<br>
```python
fileref = open('filename.ext', 'r')
for line in fileref.readlines():
    statement1
    statement2
    ...
```

However, to make the code a little simpler, and to allow for more efficient processing, Python provides a built-in way to iterate through the contents of a file one line at a time, without first reading them all into a list.

```python
fileref = open('filename.ext', 'r')
for line in fileref:
    statement1
    statement2
    ...
```

## Using ```with``` for files:
This is another mechanism that Python provides for us that cleans up the often forgotten close.
Forgetting to close a file does not necessarily cause a runtime error however if you are writing a program that may run for days or weeks at a time that does a lot of file reading and writing you may run into trouble.

We can use the ``with`` statement to make context management easy (opening and closing files):

```python
with <create some object that understands context> as <some name>:
    # do some stuff with the object
    ...
```
When the program exits the with block, the context manager handles the common stuff that normally happens at the end, in our case closing a file. Let's see an example below:

In [11]:
with open('../RefFiles/olympics_short.txt', 'r') as md:
    for line in md:
        print(line.strip())        # We use the strip() method to remove the newline chars which create an unncesessary newline.
# continue on with other code

Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China,Basketball,NA
A Lamusi,M,23,China,Judo,NA
Gunnar Nielsen Aaby,M,24,Denmark,Football,NA
Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA


### Here’s a foolproof recipe for processing the contents of a text file.
#### <b><u>Steps:</u></b>
1. Open the file using with and open.
2. Use ``.readlines()`` to get a list of the lines of text in the file.
3. Use a ``for`` loop to iterate through the strings in the list, each being one line from the file. On each iteration, process that line of text.
4. When you are done extracting data from the file, continue writing your code outside of the indentation. Using with will automatically close the file once the program exits the with block.

```python
fname = "yourfile.txt"
with open(fname, 'r') as fileref:         # step 1
    lines = fileref.readlines()           # step 2
    for lin in lines:                     # step 3
        #some code that references the variable lin
#some other code not relying on fileref   # step 4
```

However, this will not be good to use when you are working with large data. Imagine working with a datafile that has 1000 rows of data. It would take a long time to read in all the data and then if you had to iterate over it, even more time would be necessary. <br>
Hence, in this case, we would be iterating over the file itself while still iterating over each line in the file:

```python
with open(fname, 'r') as fileref:
    for lin in fileref:
        ## some code that uses line as the current line of the file
        ## some more code
```

## Reading from CSV Files:

Typically, CSV files will have a header as the first line, which contains column names. Then, each following row in the file will contain data that corresponds to the appropriate columns.<br>
The ``read``, ``readline``, and ``readlines``, and simply iterating over the file object itself - will work on CSV files.<br>
The comma is the delimiter in case of CSV files, hence we use the split() method to split the data based on these commas.

In [12]:
fref = open("../RefFiles/olympics.txt", 'r')

# Getting the lines inside a list
lines = fref.readlines()

# Capturing the header from the data
header = lines[0]

# Stripping \n and splitting the values based on comma
field_names = header.strip().split(',')
print(field_names)
for row in lines[1:]:
    vals = row.strip().split(',')
    # Ignoring players who have not won any medals
    if vals[5] != "NA":
        print("{}: {}; {}".format(
                vals[0],
                vals[4],
                vals[5]))

['Name', 'Sex', 'Age', 'Team', 'Event', 'Medal']
Edgar Lindenau Aabye: Tug-Of-War; Gold
Arvo Ossian Aaltonen: Swimming; Bronze
Arvo Ossian Aaltonen: Swimming; Bronze
Juhamatti Tapio Aaltonen: Ice Hockey; Bronze
Paavo Johannes Aaltonen: Gymnastics; Bronze
Paavo Johannes Aaltonen: Gymnastics; Gold
Paavo Johannes Aaltonen: Gymnastics; Gold
Paavo Johannes Aaltonen: Gymnastics; Gold
Paavo Johannes Aaltonen: Gymnastics; Bronze


Note that the trick of splitting the text for each row based on the presence of commas only works because commas are not used in any of the field values (e.g., Swimming, 100M Freestyle as a sport).<br>
One alternative format in case of CSV files uses a different column separator, such as | or a tab (t). Sometimes, when a tab is used, the format is called tsv, for tab-separated values). If you get a file using a different separator, you can just call the .split('|') or .split('\\t').<br>
The other advanced CSV format uses commas to separate but encloses all values in double quotes. In such a case, it is better to use the Python's built in CSV module.

## Writing data to a CSV File:
The typical pattern for writing data to a CSV file is to write a header row and loop through the items in a list, outputting one row for each.

Here is a complex example, where we a have a list of tuples, each representing one Olympian, a subset of the rows and columns from the file we have been reading from.<br>

```python 
olympians = [("John Aalberg", 31, "Cross Country Skiing"),
             ("Minna Maarit Aalto", 30, "Sailing"),
             ("Win Valdemar Aaltonen", 54, "Art Competitions"),
             ("Wakako Abe", 18, "Cycling")]

outfile = open("reduced_olympics.csv", "w")
# output the header row
outfile.write('Name,Age,Sport')
outfile.write('\n')
# output each of the rows:
for olympian in olympians:
    row_string = '{},{},{}'.format(olympian[0], olympian[1], olympian[2])
    outfile.write(row_string)
    outfile.write('\n')
outfile.close()
```

Here we use the ``.format()`` method to create a CSV format string. An alternative, also clear way to do it would be with the ``.join`` method. However, it is best to use ``.format()`` if the list contains any non-string values.
```python
row_string = ','.join([olympian[0], str(olympian[1]), olympian[2]]).
```

Whereas, just putting ``.format(olympian)`` wouldn’t work because the interpreter would see only one value (a tuple) when it was expecting three values to try to substitute into the string template <br>
Also, in cases when we have the values inside double quotes, Python allows strings to be delimited with either single quotes or double quotes so that one can be used to delimit the string and the other can be a character in the string.<br>
But in case if you need to quote all/most of the values, it is best to use the CSV module instead.
```python
olympians = [("John Aalberg", 31, "Cross Country Skiing, 15KM"),
             ("Minna Maarit Aalto", 30, "Sailing"),
             ("Win Valdemar Aaltonen", 54, "Art Competitions"),
             ("Wakako Abe", 18, "Cycling")]

outfile = open("reduced_olympics2.csv", "w")
# output the header row
outfile.write('"Name","Age","Sport"')
outfile.write('\n')
# output each of the rows:
for olympian in olympians:
    # We used double quotes here to quote the values.
    row_string = '"{}", "{}", "{}"'.format(olympian[0], olympian[1], olympian[2])
    outfile.write(row_string)
    outfile.write('\n')
outfile.close()
```