# Reading and Writing files

So far, we have typed all of our "data" into the code of our software (e.g. the names of the students, and their ages.

Most of the time, this kind of data is stored in files.  We need to read (and write) files so that we can create and use permanent copies of these data (and exchange these data with other people or software)

The function we will use to open a file is called (surprise!) "open"

open takes two arguments":
1. the path to the file
2. "how" to open the file (for read?  for write?  for "append"? for read and write?)

it looks like this:

    myfile = open("/path/to/file.csv", "r")  # opens file.csv for "read"
    
I have already created a file called students.csv that we can open now:

In [37]:
studentfile = open("students.csv", "r")
print(studentfile)

<_io.TextIOWrapper name='students.csv' mode='r' encoding='UTF-8'>



Does the output of that print statement surprise you?  What it tells you is that 'studentfile' is a Python "object" (again, we will discuss objects in more detail later, but you will start to see how they work now...)

studentfile is an object of type "TextIOWrapper" (from the "io" Python library, which is automatically installed in all Python distributions).  It knows what its filename is, it knows that it is open for "read", and it also has guessed the "encoding" of the file (UTF-8 is a kind of text encoding that allows extended text characters like the German umlaut's, and greek alpha, beta, etc.   This is a good default for us!)

# Reading information from a file

surprise!  The most basic method used to read information is.... 'read'!  This reads **the entire file**

    print(studentfile.read())
    


In [2]:
print(studentfile.read())

Mark,50
Alejandro,25
Julia,26
Denise,23
Josef,21




Now we need to talk about a feature of file input/output, called a "pointer".  The pointer is the position where the code "is" in the file - is it at the beginning?  is it at the end"?  Is it at line 5?

Where is the pointer now?  Let's try the same command again!

In [4]:

print(studentfile.read())





Nada de nada!  That's because the pointer is at the end of the file - when we say file.read it tries to read starting from the end of the file...and of course, there is nothing there.  

To reset back to the beginning, we will use the "seek" function, and set it to position '0':



In [5]:
studentfile.seek(0)
print(studentfile.read())



Mark,50
Alejandro,25
Julia,26
Denise,23
Josef,21




## More refined file access - line-by-line

Most of the time, you do not want to read the enire file into memory (tell me why this can be very very bad!.... please)

MOST of the time, a file will have one "record" per line.  e.g. our CSV file has the "name,age" for one student per line.  We want to read those lines one-at-a-time and do something useful with each record.

The method we want to use is called "readlines()"

    print(studentfile.readlines())

In [7]:
studentfile.seek(0)   # set it back to the beginning again for this lesson...

print(studentfile.readlines())

['Mark,50\n', 'Alejandro,25\n', 'Julia,26\n', 'Denise,23\n', 'Josef,21\n']



You will see that this returns a list, which means we can use it in a FOR loop...


In [17]:
studentfile.seek(0)   # set it back to the beginning again for this lesson...

for line in studentfile.readlines():
    print("the current record is", line)


the current record is Mark,50

the current record is Alejandro,25

the current record is Julia,26

the current record is Denise,23

the current record is Josef,21




We're getting closer to what we want!  We have each record as a string in the format "Mark,50".  What we want is to separate the "Mark" and the "50" so that we could put them into separate variables (e.g. *name* and *age*)

There is a ***correct*** way to this, but you already know one way to solve this problem!  

In the box below, use regular expressions to capture the name and the age into the variables *name* and *age*

<p style="visibility: hidden;">
#!/usr/bin/python3
import re  # this brings the python regular expression object into your program

studentfile.seek(0)   # set it back to the beginning again for this lesson...

for line in studentfile.readlines():
    #print("the current record is", line)
    matchObj = re.search( r'(\w+),(\d+)', line)  # match the index letter, then CAPTURE the rest of the sentence
    if matchObj:
        name = matchObj.group(1)
        age = matchObj.group(2)
        print("Name: ", name, "   Age: ", age)
    else:
        print ("No match!!")
</p>

In [None]:
# put your amazing solution here!



OK, so now you have solved the problem using regular expressions, however... the solution isn't very "abstract".  In another case, you might have a more complex record:

    Mark,50,190cm,95kg,163483,113mmhg,29mg/ml

Your regular expression would start to get ugly!  What is the one thing that is constant in this CSV file?  (in fact,the name of the file-type tells you!)

In cases like this, there is a method called "split", which will take a string and split it based on whatever separator you give it.  In this case, the comma.

  
    for line in studentfile.readlines():
        print("the current record is", line)
        name, age = line.split(',')
        

In [35]:
studentfile.seek(0)   # set it back to the beginning again for this lesson...

for line in studentfile.readlines():
    print("the current record is", line)
    name, age = line.split(',')
    print("Name:", name, "    Age:", age)


the current record is Mark,50

Name: Mark     Age: 50

the current record is Alejandro,25

Name: Alejandro     Age: 25

the current record is Julia,26

Name: Julia     Age: 26

the current record is Denise,23

Name: Denise     Age: 23

the current record is Josef,21

Name: Josef     Age: 21




Much better!  But... *Still not quite right!!*  What are all of those blank lines?  We didn't ask for blank lines...

Remember just a few minutes ago we looked at the output from readlines():

    studentfile.seek(0)   # set it back to the beginning again for this lesson...
    print(studentfile.readlines())
    
    ==>  ['Mark,50\n', 'Alejandro,25\n', 'Julia,26\n', 'Denise,23\n', 'Josef,21\n']
    
Those blank lines are because of the \n (newline) character at the end of every line.  What is happening is that the print statements above ACTUALLY look like this:

    the current record is Alejandro,25\n
    Name: Alejandro     Age: 25\n      <----- the value of the age variable after the spit is '25\n'
    
Can we discard this newline?  Sure!   The method *rstrip()* will strip all blank space (including newlines) from the end (right-hand end --> **r**strip() ) of the line:



In [41]:
studentfile.seek(0)   # set it back to the beginning again for this lesson...

for line in studentfile.readlines():
    line = line.rstrip()
    print("the current record is", line)
    name, age = line.split(',')
    print("Name:", name, "    Age:", age)


the current record is Mark,50
Name: Mark     Age: 50
the current record is Alejandro,25
Name: Alejandro     Age: 25
the current record is Julia,26
Name: Julia     Age: 26
the current record is Denise,23
Name: Denise     Age: 23
the current record is Josef,21
Name: Josef     Age: 21



<pre>


</pre>
When you have finished with an open file, it is a very good idea to close it!

    studentfile.close()   # it's a good idea to close a file once you are finished with it!  We are...
    

In [42]:
studentfile.close()   # it's a good idea to close a file once you are finished with it!  We are...


<pre>


</pre>

## Writing to a file

Writing to a file is very straightforward.  Use the same "open" command that you have already learned, but using the "w" flag ("open for **w**rite), then write information to that open file using the *write()* method.

Python will help you by creating the file if it doesn't exist.  For example, the box below will create a file named "OLDERstudents.csv" if that file doesn't exist.  ***IF IT DOES EXIST, IT WILL BE DESTROYED!!!!!  YOU CANNOT GET THE CONTENT BACK!!!!  BE CAREFUL!!!***

The file pointer is set to the beginning of the file.

Here is how easy it is:

In [47]:
olderstudents = open("OLDERstudents.csv", "w")
olderstudents.write("hello, I am writing stuff to a file!\nThis is very cool!")  # the write function, using \n (newline)
olderstudents.close()

checkcontent = open("OLDERstudents.csv", "r")
print(checkcontent.read())  # print the content of the file
checkcontent.close()

hello, I am writing stuff to a file!
This is very cool!


<pre>


</pre>
## Now you!

* create the file OLDERstudents.csv
* using the data from the original students.csv, make everyone 5 years older
* write the new older student data to the OLDERstudents.csv file, in an identical format (Mark,55....)
* do this again, but this time, create a "header line" (Student Name, Student Age)
        Student Name, Student Age
        Mark,55
        Alejandro,35
        ...
        ...
       
* do this again, but instead of creating a CSV (comma-separated value) file, create a TSV (tab-separated value)
 * call it ***OLDERstudents.tsv***
 * You need to know: the symbol for TAB is \t
 * these are the two most common structured text-file formats
 * both of these can be imported into software like MS Excel

In [None]:
# put your amazing code here!


## Append

The final mode for writing to a file is "append", which means, "open the file, and put the pointer at the END of the file".  This allows you to open a file and add new information to it, without destroying the existing information.

The append flag is 'a'

    olderstudents = open("OLDERstudents.csv", "a")   # open for append
    
We wont go through an example here, but... you might have a question about this on your exam, so try it yourself! :-)