# Processing text files
In this lesson we're going to prepare a simple text file with some short, simple content.

We're going to show you some basic techniques you can utilize to `read the file contents` in order to process them.

The processing will be very simple - you're going to copy the file's contents to the console, and count all the characters the program has read in.

But remember - our understanding of a text file is very strict. In our sense, it's a plain text file - it may contain only text, without any additional decorations (formatting, different fonts, etc.).

That's why you should avoid creating the file using any advanced text processor like MS Word, LibreOffice Writer, or something like this. Use the very basics your OS offers: Notepad, vim, gedit, etc.

If your text files contain some national characters not covered by the standard ASCII charset, you may need an additional step. Your `open()` function invocation may require an argument denoting specific text encoding.

For example, if you're using a Unix/Linux OS configured to use UTF-8 as a system-wide setting, the `open()` function may look as follows:
```py
stream = open('file.txt', 'rt', encoding='utf-8')
```

where the encoding argument has to be set to a value which is a string representing proper text encoding (UTF-8, here).

Consult your OS documentation to find an encoding name adequate to your environment.


### Note
```s
For the purposes of our experiments with file processing carried out in this section, we're going to use a pre-uploaded set of files (i.e., tzop.txt, or text.txt files) which you'll be able to work with. If you'd like to work with your own files locally on your machine, we strongly encourage you to do so, and to use IDLE (or any other IDE that you may prefer) to carry out your own tests.
```

In [3]:
# Opening tzop.txt in read mode, returning it as a file object:
stream = open("C:\dev\programming-docs\Python\Essentials-2\Modul-4\tzop.txt", "rt", encoding = "utf-8")

print(stream.read()) # printing the content of the file

OSError: [Errno 22] Invalid argument: 'C:\\dev\\programming-docs\\Python\\Essentials-2\\Modul-4\tzop.txt'

# Processing text files: continued
Reading a text file's contents can be performed using several different methods - none of them is any better or worse than any other. It's up to you which of them you prefer and like.

Some of them will sometimes be handier, and sometimes more troublesome. Be flexible. Don't be afraid to change your preferences.

The most basic of these methods is the one offered by the `read()` function, which you were able to see in action in the previous lesson.

If applied to a text file, the function is able to:

  - read a desired number of characters (including just one) from the file, and return them as a string;
  - read all the file contents, and return them as a string;
  - if there is nothing more to read (the virtual reading head reaches the end of the file), the function returns an empty string.

We'll start with the simplest variant and use a file named `text.txt`. The file has the following contents:
```s
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
```
text.txt

Now look at the code in the editor, and let's analyze it.

In [4]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    ch = s.read(1)
    while ch != '':
        print(ch, end='')
        cnt += 1
        ch = s.read(1)
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131


The routine is rather simple:

  - use the try-except mechanism and open the file of the predetermined name (text.txt in our case)
  - try to read the very first character from the file (`ch = s.read(1)`)
  - if you succeed (this is proven by a positive result of the `while` condition check), output the character (note the `end=` argument - it's important! You don't want to skip to a new line after every character!);
  - update the counter (`cnt`), too;
  - try to read the next character, and the process repeats.

# Processing text files: continued
If you're absolutely sure that the file's length is safe and you can read the whole file to the memory at once, you can do it - the `read()` function, invoked without any arguments or with an argument that evaluates to `None`, will do the job for you.

Remember - `reading a terabyte-long file using this method may corrupt your OS`.

Don't expect miracles - computer memory isn't stretchable.

Look at the code in the editor. What do you think of it?

In [5]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    content = s.read()
    for ch in content:
        print(ch, end='')
        cnt += 1
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerr(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131


Let's analyze it:

  - open the file as previously;
  - read its contents by one `read()` function invocation;
  - next, process the text, iterating through it with a regular `for` loop, and updating the counter value at each turn of the loop;
  
The result will be exactly the same as previously.

# Processing text files: readline()
If you want to treat the file's contents `as a set of lines`, not a bunch of characters, the `readline()` method will help you with that.

The method tries to `read a complete line of text from the file`, and returns it as a string in the case of success. Otherwise, it returns an empty string.

This opens up new opportunities - now you can also count lines easily, not only characters.

Let's make use of it. Look at the code in the editor.

In [6]:
from os import strerror

try:
    ccnt = lcnt = 0
    s = open('text.txt', 'rt')
    line = s.readline()
    while line != '':
        lcnt += 1
        for ch in line:
            print(ch, end='')
            ccnt += 1
        line = s.readline()
    s.close()
    print("\n\nCharacters in file:", ccnt)
    print("Lines in file:     ", lcnt)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131
Lines in file:      4


As you can see, the general idea is exactly the same as in both previous examples.


# Processing text files: readlines()
Another method, which treats text file as a set of lines, not characters, is `readlines()`.

The `readlines()` method, when invoked without arguments, tries to `read all the file contents, and returns a list of strings, one element per file line`.

If you're not sure if the file size is small enough and don't want to test the OS, you can convince the `readlines()` method to read not more than a specified number of bytes at once (the returning value remains the same - it's a list of a string).

Feel free to experiment with the following example code to understand how the `readlines()` method works:

In [7]:
s = open("text.txt")
print(s.readlines(20))
print(s.readlines(20))
print(s.readlines(20))
print(s.readlines(20))
s.close()


['Beautiful is better than ugly.\n']
['Explicit is better than implicit.\n']
['Simple is better than complex.\n']
['Complex is better than complicated.']


#### The maximum accepted input buffer size is passed to the method as its argument.

You may expect that `readlines()` can process a file's contents more effectively than `readline()`, as it may need to be invoked fewer times.

Note: when there is nothing to read from the file, the method returns an empty list. Use it to detect the end of the file.

To the extent of the buffer's size, you can expect that increasing it may improve input performance, but there is no golden rule for it - try to find the optimal values yourself.

Look at the code in the editor. We've modified it to show you how to use `readlines()`.

In [8]:
from os import strerror

try:
    ccnt = lcnt = 0
    s = open('text.txt', 'rt')
    lines = s.readlines(20)
    while len(lines) != 0:
        for line in lines:
            lcnt += 1
            for ch in line:
                print(ch, end='')
                ccnt += 1
        lines = s.readlines(10)
    s.close()
    print("\n\nCharacters in file:", ccnt)
    print("Lines in file:     ", lcnt)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131
Lines in file:      4


We've decided to use a 15-byte-long buffer. Don't think it's a recommendation.

We've used such a value to avoid the situation in which the first `readlines()` invocation consumes the whole file.

We want the method to be forced to work harder, and to demonstrate its capabilities.

There are `two nested loops in the code`: the outer one uses `readlines()`'s result to iterate through it, while the inner one prints the lines character by character.
