<center> <img src="https://github.ccs.neu.edu/caglar/DS3000/blob/master/img/ds3000.png?raw=true"> </center>

<center> <h1> Week 4 - Day 1 </h1> </center>

<center> <h2> Part 1: Data Loading </h2></center>

## Outline
1. <a href='#1'>Files</a>
2. <a href='#2'>Working with Files: The `with` Statement </a>
3. <a href='#3'>Reading Data from Files</a>
4. <a href='#4'>Writing to a Text File</a>
5. <a href='#5'>Tokenizing Lines of a Text File</a>


##  Loading Data from Files
* Variables, lists, tuples, dictionaries, sets, arrays, pandas `Series` and pandas `DataFrame`s offer only _temporary_ data storage
    * lost when a local variable “goes out of scope” or when the program terminates
* Data maintained in **files** are persistent
* Several popular formats:
    * Plain text
    * CSV (Comma-Separated Values)
    * JSON (JavaScript Object Notation)    

<a id="1"></a>

## 1. Files
* A **text file** is a sequence of characters 
* A **binary file** (for images, videos and more) is a sequence of bytes
* First character in a text file or byte in a binary file is located at position 0
    * In a file of **_n_** characters or bytes, the highest position number is **_n_ – 1**
* For each file you **open**, Python creates a **file object** that you’ll use to interact with the file

<img src = "res/eof.png" />

### 1.1. End of File
* Every operating system provides a mechanism to denote the end of a file
    * Some use an **end-of-file marker**
    * Others maintain a count of the total characters or bytes in the file
    * Programming languages hide these operating-system details from you
    
* Python EOF marker is "\n"

### 1.2. Escape Sequences
| Escape sequence | Description
| :------- | :------------
| `\n` | Insert a newline character in a string. When the string is displayed, for each newline, move the screen cursor to the beginning of the next line. 
| `\t` | Insert a horizontal tab. When the string is displayed, for each tab, move the screen cursor to the next tab stop. 
| `\\` | Insert a backslash character in a string.
| `\"` | Insert a double quote character in a string.
| `\'` | Insert a single quote character in a string.

<a id="2"></a>

## 2. Working with Text Files: The `with` Statement 
* `with` statement is used to work with files
* Acquires a resource and assigns its corresponding object to a variable
* Allows the application to use the resource via that variable
* Calls the resource object’s **`close` method** to release the resource
* Can open files for reading, writing, and/or appending
* Can specify the file-open mode

### 2.1. File-Open Modes
| Mode | Description
| ------ | :------
| `'r'` | Open a text file for reading. This is the default if you do not specify the file-open mode when you call open. 
| `'w'` | Open a text file for writing. Existing file contents are _deleted_. 
| **`'a'`** | Open a text file for appending at the end, creating the file if it does not exist. New data is written at the end of the file. 
| **`'r+'`** | Open a text file reading and writing. 
| **`'w+'`** | Open a text file reading and writing. Existing file contents are _deleted_.
| **`'a+'`** | Open a text file reading and appending at the end. New data is written at the end of the file. If the file does not exist, it is created. 

<a id="3"></a>

## 3. Reading Data from Text Files
* If the contents of a file should not be modified, open the file for reading only
    * Prevents program from accidentally modifying the file

In [None]:
with open('res/hp1.txt', mode='r') as hp1:
    for line in hp1:
        print(line)

* Built-in `open` function opens the file `hp1.txt`
* `with` statement assigns the object returned by `open` to the variable `hp1` in the **`as` clause**
* `with` statement’s suite uses `hp1` to interact with the file
* At the end of the `with` statement’s suite, the `with` statement *implicitly* calls the file object’s **`close`** method to close the file 

In [None]:
with open('res/hp1.txt', mode='r') as hp1:
    for line in hp1:
        print(line)

* `mode` argument specifies the **file-open mode**
    * whether to open a file for reading from the file, for writing to the file or both. 
* Mode **`'r'`** opens the file for *reading*
* *Reading* modes raise a `FileNotFoundError` if the file does not exist
* Iterating through a file object, reads one line at a time from the file and returns it as a string
* By convention, the **`.txt` file extension** indicates a plain text file

### 3.1. `readlines()` Method

* File object’s **`readlines`** method also can be used to read an *entire* text file
* Returns each line as a string in a list of strings

In [None]:
with open('res/hp1.txt', mode='r') as hp1:
    lines = hp1.readlines()

In [None]:
#let's look at the first 10 items in the list, lines, containing a list of string lines
lines [:10]

* For small files, this works well, but iterating over the lines in a file object, as shown above, can be more efficient
    * Enables your program to process each text line as it’s read, rather than waiting to load the entire file

### 3.2. `read() Method
    * For a text file, returns a string containing the number of characters specified by the method’s integer argument
    * For a binary file, returns the specified number of bytes
    * If no argument is specified, the method returns the entire contents of the file

In [61]:
with open('res/hp1.txt', mode='r') as hp1:
    book = hp1.read(5) #reads the first 5 characters

In [62]:
book

'Harry'

<a id="4"></a>

## 4. Writing to a Text File
* Mode `'w'` opens the file for *writing*, creating the file if it does not exist
* file object’s **`write` method** writes one record at a time to the file
* If you do not specify a path to the file, Python creates it in the current folder

In [None]:
with open("res/grade_book.txt", "w") as fgrade:
    fgrade.write("Harry 85 93 90")
    fgrade.write("Hermione 95 100 100")
    fgrade.write("Ron 86 89 91")

In [None]:
# Windows Users: View file contents
!more res\grade_book.txt

In [None]:
# macOS/Linux Users: View file contents
!cat res\grade_book.txt

### 4.1. Escape Characters in Files
* By default, the write() method adds the new line to the end of the previous line.
* Use "\n" to move the cursor to the beginning of the next line after writing to the file.

In [65]:
with open("res/grade_book.txt", "w") as fgrade:
    fgrade.write("Harry 85 93 90\n")
    fgrade.write("Hermione 95 100 100\n")
    fgrade.write("Ron 86 89 91\n")

In [66]:
!more res\grade_book.txt

Harry 85 93 90
Hermione 95 100 100
Ron 86 89 91


### Be careful
* Opening a file for writing (mode = "w") **deletes** all the existing data in the file
* If you want to keep the current content and add new content to the end of the file, use the append mode, **"a"**

In [67]:
my_str = "Ginny 75 95 90"

with open("res/grade_book.txt", "a") as fgrade:
    fgrade.write(my_str + "\n")

In [None]:
!more res\grade_book.txt

### 4.2. `writelines()` Method
* Receives a list of strings and writes its contents to a file

In [68]:
student_list = ["Neville 70 64 63", "\n", "Luna 79 85 91"]

with open("res/grade_book.txt", "a") as fgrade:
    fgrade.writelines(student_list)

In [69]:
!more res\grade_book.txt

Harry 85 93 90
Hermione 95 100 100
Ron 86 89 91
Ginny 75 95 90
Neville 70 64 63
Luna 79 85 91


<a id="5"></a>

  
## 5. Tokenizing Lines of a Text File
* Can tokenize lines using the **split()** method as we are reading lines from a file

In [1]:
with open('res/grade_book.txt', mode='r') as grades:
    for line in grades:
        print(line.split())

['Harry', '85', '93', '90']
['Hermione', '95', '100', '100']
['Ron', '86', '89', '91']
['Ginny', '75', '95', '90']
['Neville', '70', '64', '63']
['Luna', '79', '85', '91']


In [3]:
with open('res/grade_book.txt', mode='r') as grades:
    for line in grades:
        student_name, potion1, potion2, potion3 = line.split()
        student_ave = (int(potion1) + int(potion2) + int(potion3))/3
        print("Average for", student_name, "is", student_ave)

Average for Harry is 89.33333333333333
Average for Hermione is 98.33333333333333
Average for Ron is 88.66666666666667
Average for Ginny is 86.66666666666667
Average for Neville is 65.66666666666667
Average for Luna is 85.0


* For each line in the file, string method `split` returns tokens in the line as a list of strings
    * We unpack into the variables *student_name*, *potion1*, *potion2*, and *potion3*