<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 6: File IO

This notebook shows how to read and write data from and to the external files such as `.txt` or `.csv`.

## Opening files

There are two files in the folder `files`: `novartis_microsoft.txt` and `grades.csv`. To open or create a file, we will use the following syntax:

    with open(path_to_file, mode) as name_of_open_file:
        # code where the open file is referred to as name_of_open_file
        
`path_to_file` is a string that points to the file that we want to open or create. The current notebook is in the `notebooks` repository, and therefore in order to give the adress of, for example,  `novartis_microsoft.txt`, we need to provide the following path: `'files/novartis_microsoft.txt'`.

`mode` is a string that defined the mode in which you are going to work with the file. The main modes are the following ones:
  * `'r'` (read): in this case we expect the file with the indicated name to already exist, and we are going to read the file line-by-line, where lines are separated by a new line character from each other;
  * `'w'` (write): opening a file with a writing mode will _create_ that file on the computer and will allow us to write strings into that file;
  * `'a'` (append): opens an already existent files and allows to add new lines to the end of that file.
  
There are many other modes in which it is possible to open a file, but you can read about them on your own [here](https://stackabuse.com/file-handling-in-python/).

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    for line in file:
        print(line)

The variable `file` is a name for the .txt file when it is loaded in the memory are ready to be processed. Its type is `<class '_io.TextIOWrapper'>` and it is an iterable that contains ordered strings.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    print("Type of `file`:", type(file), "\n")
    for line in file:
        print(type(line))
        print(line)

Every line in a text file ends with a new line character `\n` -- this is how we know when a new line starts! However, if you want to avoid printing a new line every time you are displaying the line, we can use the string method `strip`.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    for line in file:
        print(line.strip(), end = " ")

If instead of iterating through the lines of the file you want to get access to all of them at ones, we can read all the lines of it into some variable by using `readlines` method: it creates a list of strings, where every string is a separate line of the file.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    lines = file.readlines()
    print(type(lines))
    print(lines)

Another way to avoid overt iteration and to get lines one by one, is to read them in memory one after another by using the `readline` method.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline()
    print(line)
    line = file.readline()
    print(line)

Notice, that every time you execute `readline`, it moves the the next line of the file. We need to use `seek` method that goes to the bite indicated of the file, and therefore using `seek(0)` will move us back to the very beginning of the file.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline()
    print(line)
    line = file.readline()
    print(line)
    file.seek(0)
    line = file.readline()
    print(line)

If you are using the `with open(filepath, mode)` syntax, the file is being open in the memory only while the indented code is being executed. As soon as we finished executing the code within the `with` codeblock, the variable `file` becomes unavailable.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline().strip()
    print(line)
    
line = file.readline()

Another way to open the file and keep it in memory _until explicitly closed_, is to create the `open(file)` object in memory. Then, after the file was processed, it needs to be closed using the `close` method.

    file = open(filepath, mode)
    # code 
    file.close()
    
**Warning:** if the file is open in the `w` mode, i.e. if the file is being created, failure to close the file will result in losing all the information that we intended to write in that file. In other modes, it can result in file damage as well.

In [None]:
file = open('files/novartis_microsoft.txt', 'r')
line = file.readlines()
print(line[0])
file.close()

Even though the file is closed, the variable `lines` is still active: `readlines` loaded all the lines from the file into `lines` before we closed the file.

In [None]:
print(line[2].strip())

## Writing files

As I mentioned before, the mode `w` opens the files in the writing mode, i.e. creates the files.

* `readline` reads a line and returns a _string_ containing that line;
* `readlines` reads all lines and returns a _list of strings_.

In the writing mode, there are methods that write line or lines in a similar manner:

* `writeline` rakes a _string_ as its argument and writes it to the newly created file;
* `writelines` takes a _list of strings_ as its argument and writes all of them to the newly created file.

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world!", "It is Wednesday.", "Middle of the week!"]
file.writelines(text_to_write)
file.close()

**Practice.** As you see, the strings of the `text_to_write` are concatenated with each other: `writelines` does not add a _separator_, therefore we would need to take care of it by ourselves. Fix the code above so that the file `newfile.txt` has every sentence starting from a new line.

**Warning:** it is possible to write only lists of strings. If the data that needs to be written contains other data types, make sure to convert them to strings!

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world!", 42]
file.writelines(text_to_write)
file.close()

The usual `str` function takes care of converting nearly any datatype to its string representation.

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world! ", str(42)]
file.writelines(text_to_write)
file.close()

## Working with CSV files

A very simple way to store tables in files is to do it in `.csv` files. CSV stands for "comma separated values", and indeed, these files look like this:

    Name,Last Name,Department,Points
    Matt,Bellamy,AMS,79
    Dominic,Howard,LIN,82
    Chris,Wolstenholme,CSE,72
    
It is in fact possible to engineer a way to work with csv files using the same methods we already discussed.

In [None]:
with open('files/grades.csv', 'r') as file:
    for line in file:
        print(line.strip())

Every line of the file is still a string, and therefore to represent them as a list of values, we will need to split them.

In [None]:
with open('files/grades.csv', 'r') as file:
    for line in file:
        print(line.strip().split(","))

A simpler way to read csv files in Python is to use `csv` or `pandas` packages.

### Working with csv through `csv` package

In [None]:
import csv

In order to read a csv file using the `csv` package, right after opening the file, we need to define a `csv.reader` for it. It will parse the rows automatically!

In [None]:
with open('files/grades.csv', 'r') as file:
    csvreader = csv.reader(file)
    for row in csvreader:
        print(row)

Similarly, to write files, we want to define a `scv.writer` and change the editing mode to `w`. Then we will be able to write rows of the csv one-by-one by applying `writerow` method to the `csv.writer` object.

In [None]:
with open('files/greetings.csv', 'w') as file:
    csvwriter = csv.writer(file)
    csvwriter.writerow(["hello", "hi", "howdy"])
    csvwriter.writerow(["zdravstvujte", "privet", "hej"])

You can read more about the functionality of the `csv` package [here](https://docs.python.org/3/library/csv.html).

However, frequently we want to extract the values from a particular _column_ and this might be slightly more tricky then extracting a row.

### Working with csv through `pandas` package

`pandas` is a package that has a wide variety of uses, and one of them is the ease of extraction a column from a csv file. Here, we see a new way to import a package:

    import pandas as pd

It means that you are importing `pandas`, but instad of the full name, you are going to refer to the package as `pd`.

In [None]:
import pandas as pd

We can then use `pd.read_csv(filepath)` in order to import the csv file. And then the columns can be simply referred to by their names! Read [here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) more on the data processing with pandas.

In [None]:
grades = pd.read_csv('files/grades.csv')
grades["Name"]

# Homework 6

**Due on Thursday, October 10th, 11.59pm**

Send your notebook (don't forget to save your solutions!) to <alena.aksenova@stonybrook.edu> with the subject **\[CompLing1\] Homework 6**.

**Problem 1.** Based on the csv `files/grades.csv`, create a dictionary that will have students' names as keys, and their grades as values.

**Problem 2.** Use the file in `files/grades.csv ` in order to extract values from the column "Department" without using `pandas` package.