**Reading text files**

**Special characters and the ASCII table**

Before we get into the subject of reading text files, we need to talk about strings with special characters.

The standard ascii character set consists of 128 characters whose codes (expressed in decimal, hexadecimal, and octal) are shown in this table

http://www.asciitable.com/

Note that some of the characters, such as _space_, _line feed_, and _tab_, are what we call _whitespace_ characters.


The table also provides the additional what are referred to as 128 _extended ascii_ character codes. The term *extended ascii* is outdated and additional characters are included using what is referred to as Unicode encodings, something we will discuss later.

When we talk about *text* files, we'll mean a file whose characters are among the standard ASCII characters.

To represent one of these 256 characters in a file, we need 8 bits that is, a single byte. Your computer stores data in files as a series of bytes.

When we create a string, some special ASCII characters (e.g. tab, newline, backslash) are specified using an escape character. 

In [1]:
text="this is \tsome text\nandsome more\\text"
print(text)

this is 	some text
andsome more\text


**Reading from a text file**

In Python (like for most programming languages) to read from a text file, we open it, read from it and close it when we're done.

We can 

- read the entire file as a text string
- read the file line by line
- read the file character by character

In Jupyter, we can create a text file to play around with.

When opening a file, if a full path is not given, the assumption is that the file is in the current working directory.

We can 

- determine what the current working directory is
- change the current working directory, and
- list  what's in the currend working directory

In [2]:
import os
folder=os.getcwd()
print(folder)

E:\OneDrive - Johns Hopkins\CurrentCourses\553.488.Fall.2022\JupyterNotebooks


In [3]:
os.chdir("E:\\OneDrive - Johns Hopkins\\CurrentCourses\\553.488.Fall.2022\\JupyterNotebooks\\ReadyForRecording")
os.listdir()

['.ipynb_checkpoints',
 'bad.txt',
 'binfile',
 'CreateBadFile.ipynb',
 'mytext.txt',
 'ReadingTextFiles.ipynb',
 'text.txt']

To open a file for reading, we need to specify an identifier (fileid) for the file to be opened. This fileid can be

- the filename if it is in the current working directory, or
- a relative path to the file, or
- an absolute path to the file

The **open** function also has an argument which is the *mode*.
The defaut mode is "rt" meaning that the *mode* is to be

- read only
- text (later in the course we'll talk about binary files)


**Filename only**

We open the file and assign to a variable (**fin** is used below) a *file handle*.

Read the entire file into a string **text** and close the file.

In [4]:
fileid="text.txt"
fin=open(fileid,"rt") # the "rt" here is superfluous being the default
text=fin.read()
print(text)
fin.close()

Here is a line of text.
Here is another.
This is a last line.


**Relative path**

In the example below, a file (text2.txt) resides in a folder above the current working directory and we give a *relative* path to that file.

In [5]:
fileid="../text2.txt"
fin=open(fileid)
text=fin.read()
print(text)
fin.close()

Here is a line of text.
Here is another.
This is a last line.


**Absolute path**

Here we specify the full path to the file.

In [6]:
fileid="E:\\OneDrive - Johns Hopkins\\CurrentCourses\\"
fileid+="553.488.Fall.2022\\JupyterNotebooks\\text2.txt"
fin=open(fileid)
text=fin.read()
print(text)
fin.close()

Here is a line of text.
Here is another.
This is a last line.


**Why double backslashes?**

Above, we used double backslashes so that a literal backslash was put in the string.

If we didn't do that we can get an error if some backslash preceding a character has some special meaning. For example, 

\\" means a literal double quotation mark

\t means a tab character

In [7]:
fileid="E:\OneDrive - Johns Hopkins\CurrentCourses\\"
fileid+="553.488.Fall.2022\JupyterNotebooks\text.txt"

**Files containing non-ASCII characters**

If a file contains characters in that are part of the extended ASCII set, we typically get an error if we try to read this in as a text file.

The file "bad.txt" contains all of the characters corresponding to bytes decimal 128 through 255 in the ASCII table. 

We can inspect the file using a hex editor and here we see the bytes of the file in hexadecimal (base 16) notation, where the digits are

    0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

so that A=10, B=11, C=12, D=13, E=14 and F=15.

The byte range for those extended ASCII characters is hexadecimal 80 through FF.

For example, the character whose hexadecimal representation is EA has decimal representation

$$ 14 \times 16 + 10 = 224 $$

In any case, when we try to read this file we get an error. 

We'll talk later about how to properly read the bytes of such a file when we discuss *binary files*.

In [8]:
fin=open("bad.txt")
text=fin.read()
fin.close()
print(text)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 47: character maps to <undefined>

**Reading one line at a time**

The readline method allows for reading a line at a time.


In [None]:
fin=open("text.txt")
while True:
    s=fin.readline()
    if s:
        print(s)
    else:
        break
fin.close()

**Removing the trailing newline character**

Observe that the last character in each line is  a newline character.

We can use **strip** to remove it.

In [None]:
text="this is a test\n"
text

In [None]:
text.strip()

**Cleaner method**

In [None]:
fin=open("text.txt")
while True:
    s=fin.readline()
    if s:
        print(s.strip())
    else:
        break
fin.close()

**File handle as iterator**

Here is a better way to loop over lines in a file.

In the following example, we make a list of strings rather than printing them out.

In [None]:
fin=open('text.txt')
lines=[]
for x in fin:
    lines.append(x.strip())
fin.close()
print(lines)

**with**

A more preferred way to proceed uses the *with* keyword. This will automatically close the file after executing the subsequent code block.

In the example below, we test at the end whether the file is closed.

In [None]:
L=[]
with open('text.txt') as fin:
    for line in fin:
        L.append(line)
print(L)
if fin.closed:
    print("file is closed")

**Enumerate**

We can create from an iterable an enumerator, which is a generator of 2-tuples.

In [None]:
L=["apple","orange","banana","tangerine"]
E=enumerate(L)
print(next(E))
print(next(E))
print(next(E))
print(next(E))

**Retaining line numbers**

When we create a list of lines we might wish to track the line numbers.

Enumerate enables us to do this.

In [9]:
with open("text.txt") as fin:
    E=enumerate(fin)
    for e in E:
        print(e)

(0, 'Here is a line of text.\n')
(1, 'Here is another.\n')
(2, 'This is a last line.')


The enumerator won't work outside of the *with* block.

In [10]:
with open("text.txt") as fin:
    E=enumerate(fin)
for e in E:
    print(e)

ValueError: I/O operation on closed file.

But we can create a list inside the block.

In [11]:
with open("text.txt") as fin:
    L=list(enumerate(fin))
for e in L:
    print(e)

(0, 'Here is a line of text.\n')
(1, 'Here is another.\n')
(2, 'This is a last line.')


**File pointers**

When we open a file, a file *pointer* is positioned at some location in the file.

We can control this positition and read specified number of characters.

This gives us the capability of reading characters from a file one by one.

In [12]:
fin=open("text.txt")
fin.seek(4)
txt=fin.read(3)
print(txt)

fin.seek(12)
txt=fin.read(1)
print(txt)

fin.seek(0)
txt=fin.read(4)
print(txt)

fin.close()

 is
n
Here


In the following, we read a character one at a time until we obtain an empty character.

In [13]:
fin=open("text.txt")
ctr=0
L=[]
while True:
    x=fin.read(1)
    if not x: # or we could use if x=="" here
        break
    L.append(x)
    ctr+=1
print(ctr)

61


In [14]:
L

['H',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'l',
 'i',
 'n',
 'e',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'e',
 'x',
 't',
 '.',
 '\n',
 'H',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'a',
 'n',
 'o',
 't',
 'h',
 'e',
 'r',
 '.',
 '\n',
 'T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'l',
 'a',
 's',
 't',
 ' ',
 'l',
 'i',
 'n',
 'e',
 '.']

**Writing to a text file**

Writing to a file involves opening a file for writing, writing a string to the file, then closing it.

As mentioned earlier, when we open a file and only give the fileid, it opens in reading mode only by default - this is the same as using "rt" as the mode.

- r stands for read
- t stands for text

In order to write to a file we must specify a "w" in the mode, and again, since the default file type is text we don't need to specify the mode as "wt" - but we could.

The same rules apply as before in specifying a fileid. The specification can be

- a file name if the file is to be written in the current working directory
- a relative path specification
- an absolute/full path specification

In the following we write a string to a file called "output.txt" in the current working directory.

When such a file does not exist it gets created.

In [1]:
with open("output.txt","w") as fout:
    text="This is some text.\nAnd some more."
    fout.write(text)

To test this we read the contents of the file and print it.

In [2]:
with open("output.txt") as fin:
    text=fin.read()
print(text)

This is some text.
And some more.


**Opening for writing empties the file**

Be careful!!!  If we open an existing file to write to, its current contents get destroyed - the file becomes empty.

In [3]:
with open("output.txt","w") as fout:
    pass
with open("output.txt") as fin:
    text=fin.read()
print(text)




**Append**

If want to *append* to a file, i.e. add more text without removing what's already there, we use the "a" specification.

If the file doesn't exist, it gets created.

In [4]:
with open("output.txt","w") as fout:
    fout.write("Let there be light ... ")

with open("output.txt","a") as fout:
    fout.write("and there was light.")

with open("output.txt","r") as fin:
    text=fin.read()
    print(text)

Let there be light ... and there was light.


**Write can only write a string**

The following produces an error.

In [5]:
with open("output.txt","a") as fout:
    pi=3.1415926535
    fout.write(pi)

TypeError: write() argument must be str, not float

**Stringify**

We have to stringify an object in order to write it to a file

In [6]:
with open("output.txt","a") as fout:
    pi=3.1415926535
    fout.write(str(pi))

**Mutiple file handles for same file**

Python lets us have two different file handles for the same file but this should definitely be avoided if you don't know what you are doing. 

You can have two different file handles for reading, each with its own file pointer. 

But if you try to modify the file and have a handle modifying the file and other reading or making modifications, results can be unpredictable and you should definitely avoid this.


In [7]:
with open("output.txt","w") as fout:
    fout.write("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
fin1=open("output.txt")
fin2=open("output.txt")
fin1.seek(13)
fin2.seek(5)
L=[]
for i in range(10):
    L.append(fin1.read(1))
    L.append(fin2.read(1))
print(L)

['N', 'F', 'O', 'G', 'P', 'H', 'Q', 'I', 'R', 'J', 'S', 'K', 'T', 'L', 'U', 'M', 'V', 'N', 'W', 'O']


In [8]:
fout=open("output.txt","w")
fin=open("output.txt","r")

fout.write("This is a test")
text=fin.read()
print(text)

fout.write("Another test")
text=fin.read()
print(text)

fout.close()
fin.close()



