# SLU06 - String & File Handing

In this notebook we will be covering the following:

### File Handling
- Read and write files in Python  

### String manipulation
- String concatenation
- lower/upper
- replace
- strip, listrip, rstrip
- index characters (strings indexing)
- join
- split
- substring
- loop through strings

### Testing, Exception raising & Handling
- Common errors
- Assert function
- Raise and handle errors

## 1) File Handling

Handling Python files is a very important concept.

Developers build a lot of programs, data scientists create different models, and all of them need to firstly read some data and then save the results.

Imagine that you wanted to check how many files on your computer have a word 'Puppy'. I'm sure you wouldn't want to do it by hand! Opening hundreds and thousands of files, counting words...Boring! Let's make computer do it!

![img](markdown/do.jpeg)

There are multiple ways to read and save files, and they usually depend on the file format you're dealing with.

Images, text data, json-files, csv-files - there are many and many of them.

In this lesson we're going to learn how to handle files using built-in python functions.

Let's start!

### 1.1) Open Files

In order to open a file in Python we can use a build-in **open()** function. 
This functions gets 2 arguments as an input: 
- file path
- mode (optional parameter. By default mode='r')

The path of the file is its adress on the device were the file is stored. The directory is the adress of the folder were the file is.   

Thing on this analogy:



`Device` is the world, and `path` is the adress of the `file` in this world. `Directory` is the adress of the `house` where the `file` is strored.

A __path__ of a file can be relative or absolute.

An __absolute path__ begins with the root folder.    

A __relative path__ is relative to the program’s current working directory.

<img src="markdown/path.png" width="900" height="300" >

For file `document.txt`, on image above:

__Directory__ is `C:\home\ds-prep-course-workspace\week 3\SLU06 - String and Files Handling`;

__Path__ is:   

- __absolute path__  is `C:\home\ds-prep-course-workspace\week 3\SLU06 - String and Files Handling\documents.txt`;    
- __relative path__ is `document.txt`;

The __mode__ indicates the permission that we want to give, if we allow to do operations like reading it and writting on it. 
This parameter is optional, but by using it we are protecting the file of operations that we don't want to be executed.

The __modes__ are: 
- ‘r’ – Read mode which is used when the file is only being read 
- ‘w’ – Write mode which is used to edit and write new information to the file (any existing files with the same name will be erased when this mode is activated) 
- ‘a’ – Appending mode, which is used to add new data to the end of the file; that is new information is automatically amended to the end 
- ‘r+’ – Special read and write mode, which is used to handle both actions when working with a file



Using relative path, let's open the file `document.txt` on read mode that is located on the same folder as this SLU.

In [1]:
relative_path = "document.txt"
f = open(relative_path, 'r')
f

<_io.TextIOWrapper name='document.txt' mode='r' encoding='UTF-8'>

Actually, **open()** doesn't read the file's content. As the name suggests, it only opens the file, returns a file object and keeps it in memory. Now that we have access to the file object, we can use some methods, which is a special type of function (you'll learn about it next week, so no need to worry about it now), to read the content of the file, or write to the file. Let's see how we can do that!

### 1.2) Read Files

There are a few ways to read file's content, let's see all of them.

#### 1.2.1) Read the whole content of a file

We're going to read the whole file's content and store it in a variable called *text*:

In [2]:
text = f.read()

In [3]:
text

'maria\nate an apple\npie for breakfast'

Great! We can see it now. All the lines are stored in one variable delimited by *\n* symbols (*\n* means the new line).

It doesn't look pretty until we print it:

In [4]:
print(text)

maria
ate an apple
pie for breakfast


Now it looks much better. As we can see, all *\n* symbols were removed and the text was printed on 3 separate lines (the same way as in the original file). 

You can open this file in a text editor if you don't trust me!

![image.png](markdown/trust.jpeg)

#### 1.2.2) Read Line By Line

With `read()` method we're storing the whole file in just one string. 

It is also possible to read each line separatly using `readlines()`. What `readlines()` does is read all lines in the file and returns a list containing each line as a list element. We can then loop through this list, and access each of the line. Let's see how it's done!

In [9]:
lines = f.readlines()
print(lines)

[]


Wait, but why is it empty?


<img src="markdown/wait.jpg" width="300" height="150" >


Let's think about what we did.

We firstly opened the file and stored it in a variable called *f*. Once we opened a file, a read cursor appeared in the very beginning of this file (the same way as if we'd open it in a text editor).
Next, we called **read()** method and stored the result in a variable called *text*. Method **read()** read through the entire file and left the read cursor at the end of the file (with nothing more to read).

So the cursor was in the end of the file when we tried to read lines separately. 

Now let's move the cursor back and do the same thing once again using method **[seek()](https://www.tutorialspoint.com/python/file_seek.htm)**.

In [6]:
f.seek(0) # moves the cursor to the very beggining of the file
lines = f.readlines()
print(lines)

['maria\n', 'ate an apple\n', 'pie for breakfast']


As you can see, the output of `readlines()` is a list, with each element is a line. 

Let's iterate over the list and print its elements:

In [7]:
for line in lines:
    print(line)

maria

ate an apple

pie for breakfast


We can access separate lines simply indexing the list:

In [8]:
print(lines[0])

maria



#### 1.2.3) Read a single line

Actually, we don't have to read the whole file if we don't want to.

Sometimes a file might be large, and we don't want to store large files in the memory. Our PC has a limited amount of memory, so we can't just put terabytes of images in there. So here line-by-line reading comes in!

Let's just read the first line of the file (don't forget to move the cursor)

In [12]:
f.seek(0)
f.readline()

'maria\n'

And the second line:

In [10]:
f.readline()

'ate an apple\n'

As you can see, every time we execute the readline method, the cursor moves to the next sentence. 

Because of that, the output of readline() is different everytime we call it.

So, we already know how to open a file, read the whole file or read a single line.

### 1.3) Close

It's important to mention, that the same way as we usually close programs after we used them on a PC, we also should close files in python:

In [11]:
f.close()

### 1.4) 'with open()' statement

If we don't want to close the files each time we use it, we can use 'with open()' statement. 

In [12]:
with open('document.txt') as f: # open the file and store it in 'f'
    lines = f.readlines() # read all the lines
    for line in lines: # iterate over each separate line
        print(line) # print this line

maria

ate an apple

pie for breakfast


As soon as python compiler reaches the end of "with" statement, the file will automatically be closed.

### 1.5) Writing to files

The last thing to mention is the fact, that we can also write information to files

What we need to do is:
- Open the file with permissions to write (by default, when we call **open()** function, it opens the file with read-only permissions)
- Write lines to the file
- Close the files

In [13]:
# open a new file with write permissions (mode='w')
with open('new_document.txt', mode='w') as f:
    f.write("I just learned how to deal with files in python")

Now find this file locally! It's going to be stored in the same directory where this jupyter notebook is stored.

And now that we wrote on the file let's check what is inside with what we learned so far.

In [14]:
with open('new_document.txt') as f:
    print(f.readlines())

['I just learned how to deal with files in python']


## 2) Strings Manipulation

We already learned about different types of variables in python (int, string, float etc.)

Now it's time to learn about strings deeper.

As you already now, strings represent text in programming languages.

Programmers and data scientists work with strings all the time. Look around, almost everything has text!
More then that, there is a big field in Data Science called Natural Language Processing, in which the only thing we work with is ... text!

Text is so good that matematicians decided to add some letters to formulas just because it's beautiful, you see? 

So let's learn how to deal with it!

!['img'](markdown/text.jpeg)

In [15]:
f = open('document.txt')
lines = f.readlines()
sentence_1 = lines[0]
print(type(sentence_1)) # print the type of the variable
sentence_1

<class 'str'>


'maria\n'

### 2.1) strip, rstrip and lstrip Methods

As you can see, sentence_1 is a string *(<class 'str'>)* representing the first line of the *document.txt*

It contains *\n* symbols, so let's remove them.

We can use **strip()** method for that.

**Strip()** allows us to remove any whitespaces (whitespace is any horizontal/vertical space. So both " " and "\n" symbols) in a string.

There are also **lstrip()**, which removes whitespaces to the left of the string, and **rstrip()** methods, which removes whitespaces to the right of the string

In [16]:
s = '\n\n   word   \n\n'
s.strip()

'word'

In [17]:
s.rstrip()

'\n\n   word'

In [18]:
s.lstrip()

'word   \n\n'

We can also define the symbol that should be removed by strip() method. So if we want to keep spaces, but want to remove "\n" symbols, we can do it the following way:

In [19]:
s.strip('\n')

'   word   '

### 2.2) Strings concatenation

Now let's read all the lines from the document, concatenate them (combine them in one string) and remove all the \n symbols:

In [20]:
lines

['maria\n', 'ate an apple\n', 'pie for breakfast']

In [21]:
# concatenate strings using + and remove whitespaces using strip() function
sentence = lines[0].strip() + lines[1].strip() + lines[2].strip() 
sentence

'mariaate an applepie for breakfast'

Looks like we also need to add spaces between lines. 

Let's do it:

In [22]:
# concatenate strings with spaces (' ')
sentence = lines[0].strip() + ' ' + lines[1].strip() + ' ' + lines[2].strip() 
sentence

'maria ate an apple pie for breakfast'

### 2.3) lower and upper methods

Now let's learn about **lower()** and **upper()** methods.

Python allows us to easily change the whole line to upper or lowercase if we want to:

In [23]:
sentence = sentence.upper()
print(sentence)

MARIA ATE AN APPLE PIE FOR BREAKFAST


![capital](markdown/capital.jpeg)

In [24]:
sentence = sentence.lower()
print(sentence)

maria ate an apple pie for breakfast


This functions might be really useful in some cases. 

For example, trying to understand if a movie review is positive or negative. 

It's often a good idea to convert all the words to lowercase, because there is no difference for us if a user wrote 'bad', 'BAD' or 'Bad' - all these 3 strings are different for the machine, but they have the same meaning for us.

### 2.4) Replace method and substrings

We'd like to be polite and call Maria with a capital letter. 

One of the ways to do it is with **replace()** method. It allows us to replace all *substrings* A with another substring B. 

P.S, substring is a string which is a part of another string. E.g., 'ma' is a substring of 'maria'.

Let's firstly replace all 'm' characters with 'M'.

In [25]:
sentence.replace('m', 'M')

'Maria ate an apple pie for breakfast'

Remember that this method doesn't replace the original string. It only returns a copy of the original string with the replaced letters.
If we want to overwrite the original string, we can simply write:
`sentence = sentence.replace('m', 'M')`

Now let's try to replace some words in the function above so we get a sentence "Anna ate a big apple pie" and save them in a separate variable

In [26]:
new_sentence = sentence.replace('maria', 'Anna')
new_sentence = new_sentence.replace('an', 'a big')
new_sentence = new_sentence.replace(' for breakfast', '')
print(new_sentence)

Anna ate a big apple pie


And replace 'a' with 'the' in the this sentence

In [27]:
new_sentence.replace('a', 'the')

'Annthe thete the big thepple pie'

Oops, something went wrong. 

Remember I told you this function replaces any substring? That's what it did - replaced all 'a' with 'the'. Be careful with it.

### 2.5) Strings indexing

The same way as with lists, we can index strings using this notation: string[index].
It's a simple way to get just one string's character

But first let's remind ourselves about python lists. 

Do you remember how we accessed the first word in a list of 5 words in the beginning of the exercise? `word = lines[0]`

In some meaning a string is similar to a list of characters. We can index them the same way as we do with lists:

In [28]:
# print the whole string
print(sentence)
# print the first character of the string (index 0)
print(sentence[0])

maria ate an apple pie for breakfast
m


### 2.6) Strings slicing

The same way as with lists, we can slice strings. 

Let's print the first word of the sentence. `maria` takes indexes from 0 to 5:

In [29]:
sentence[0:5]

'maria'

If now we combine our knowledge of strings concatenating, indexing and slicing, we can build a whole new string from the original one.

In [30]:
sentence[3].upper() + ' like ' + sentence[10:18]

'I like an apple'

Let's use the same technique to replace the first letter in the original sentence: 
> maria -> Maria.

In [31]:
# apply upper() to the first letter
# concatenate it with the rest of the sentence (starting from second letter)
sentence = sentence[0].upper() + sentence[1:] 
print(sentence)

Maria ate an apple pie for breakfast


### 2.7)  join method

There is another way to combine strings. 

Let's use **join()** for that. 

The **join()** method takes all items in an iterable and joins them into one string, separating each element by some symbol.

Let's first try to use spaces to separate the words from the list on the outputed string.

In [32]:
' '.join(['maria', 'ate', 'an', 'apple', 'pie', 'for', 'breakfast'])

'maria ate an apple pie for breakfast'

And now with '_' to separate words.

In [33]:
'_'.join(['maria', 'ate', 'an', 'apple', 'pie', 'for', 'breakfast'])

'maria_ate_an_apple_pie_for_breakfast'

### 2.8) Split method

The last method that is really useful when we work with strings is `split()`.

`split()` does something opposite to `join()` method. 

`join()` merges multiple strings into one separating them by a given string (space as an example), whereas `split()` divides a string into a list of strings using a separator. By default, separator is a space symbol. It means that any words separated by space symbol will appear as separate elements of a list. 

Let's call `split()` on the last line of our text ('pie for breakfast') to see what happens:

In [34]:
# original line
lines[2]

'pie for breakfast'

In [35]:
# split the line
lines[2].split()

['pie', 'for', 'breakfast']

How is it useful? 

- Let's imagine this task: 

> For each separate word in a sentence, save the ones that have a length > 5. 

Hint: we can apply len() function to strings the same way as we did with lists

In [36]:
text = "A pie is a baked dish which is usually made of a pastry dough casing that contains a filling of various sweet or savoury ingredients"

- Possible solution for this task:

In [37]:
long_words = []
for word in text.split():
    if len(word) > 5:
        long_words.append(word)
print(long_words)

['usually', 'pastry', 'casing', 'contains', 'filling', 'various', 'savoury', 'ingredients']


Or we can do the same thing in a list comprehension in just one line:

In [38]:
long_words = [word for word in text.split() if len(word) > 5]
print(long_words)

['usually', 'pastry', 'casing', 'contains', 'filling', 'various', 'savoury', 'ingredients']


### 2.9) Combine what we learned

Now let's try to use everything we learned in this lesson for a much more complicated task. 

We'll also use some things that we learned in the previous lessons (like dictionaries)

![combine](markdown/combine.jpg)

- Task Description:

> 1. Create a function preprocess_text() that get a string as an input
2. Return a frequency dictionary, whose keys referring to all unique words and values referring to the number of times the words appear
3. Don't forget to apply lowercase to the words
4. Replace each numeric value (e.g. 123, 1.5) with a word 'number'.

Example: \

input: `"I like 123 I print 213"`

output: `{'i' : 2, 'like' : 1, 'number' : 2, 'print' : 1}`

- Possible solution for this task:

We're going to see 2 ways of doing that.

For the first way let's create a few helper functions.

We're going to create a function **helper()** to create a list of lowercase words from the original text and replace each number with a word 'number'. 

Then we create a function called **count_words()** to count the number of times a word appears in the list.

And in the end we call both of them from **preprocess_text()** function

When you are going through these functions, no worry if you might not understand some parts. We have written a recap session below, which will explain everything in details.

In [39]:
# input: string. 
# output: list of lowercase words. All the numbers replaced with 'number'
# example: 
# string = 'How was your day 123'
# helper(string) --> ['how', 'was', 'your', 'day', 'number']
def helper(text):
    """
    create a list of lowercase words from the original text
    replace each number with a word 'number'
    """
    text = text.split()
    result = []
    for word in text:
        if word.isdigit(): # check whether a word is numeric
            result.append('number')
        else:
            result.append(word.lower())
    return result

In [20]:
# input: list of words. 
# output: dictionary with all unique words and the number of times each word appears in the list.
# example: 
# words_list = ['day', 'day', 'evening']
# count_words(words_list) --> {'day' : 2, 'evening' : 1}
def count_words(words_list):
    """
    count each word appearence
    """
    vocab = {}
    for word in words_list:
        if word in vocab.keys():
            vocab[word] += 1
        else:
            vocab[word] = 1 
    return vocab

In [23]:
# input: string
# output: dictionary with unique lowercase words and numbers, and the number of times they appear in the string
# example: 
# text = ['How was your day day 123 evening']
# preprocess_text(text) --> 
# {'how' : 1, 'was' : 1, 'your' : 1, 'day' : 2, 'number' : 1, 'evening' : 1}
def preprocess_text(text):
    text = helper(text)
    vocab = count_words(text)
    return vocab

In [42]:
preprocess_text('I like 123 I print 213')

{'i': 2, 'like': 1, 'number': 2, 'print': 1}

Let's recap what we did step by step:
### helper(text):

- **split()** all the words in a string.
- for each word in the splitted string, check whether the word is numeric.
- if it's numeric, add 'number' to our final list of words. If it's not numeric, add the word itself to the final list of words.

### count_words(words_list):

- create a dictionary called 'vocab'
- for each word in the list of words, check whether this word is already in the vocabulary
- if it's not in the vocab, we understand that we never met this word before, so we add it to the vocabulary and set the number of occurances to be equal to 1.
- if the word is already in the vocabulary, it means that we already met it before, so the only thing we need to do is increasing the number of occurances by 1.

### preprocess_text(text):

- Apply helper on the input string. Receive a list of lowercase words.
- Apply count_words() on the list of lowercase words. Receive the number of words occurances 
- Return it

Now, let's put everything in one function and make it shorter with the help of list comprehensions:

In [43]:
def preprocess_text(text):
    # create a list of lowercase words from the original text
    # (the same thing we did in helper() function)
    words = [word.lower() if not word.isdigit() else 'number' for word in text.split()]
    # count words
    # (the same thing we did in count_words() function)
    vocab = {}
    for word in words:
        if word in vocab.keys():
            vocab[word] += 1
        else:
            vocab[word] = 1 
    return vocab

In [44]:
preprocess_text('I like 123 I print 213')

{'i': 2, 'like': 1, 'number': 2, 'print': 1}

## 3) Exceptions and errors handling

Until now we didn't talk about errors and exceptions, but they are an important part of every programming language.
We want to be sure that our program works in all the possible scenarios, and that it doesn't crash in the very important moment. More than that, we might want to expect some types of errors and make our program behavior differently in different cases.
So let's see!

### 3.1) Exceptions

#### 3.1.1) Exception examples

While doing other homeworks you probably have seen several type of errors (exceptions). Many of them are built-in errors, and they always say where to look for the error. Let's see a few examples:

#### 3.1.1.1) Syntax Error   
As the name suggests, there is some syntax error in your code. It usually means, that you made a typo somewhere

In [45]:
# typo in the code: there are no colon after if statement.
if True
    print('true')

SyntaxError: invalid syntax (<ipython-input-45-f0df6d1fc87f>, line 2)

#### 3.1.1.2) Zero Dvision Error   
Means that you're trying to divide by zero (captain obvious). Check any places with division and see what are the values of variables there 

In [46]:
# we can't divide by zero
1/0

ZeroDivisionError: division by zero

#### 3.1.1.3) Type Error - means that the operation you're trying to perform is expecting for another data type

In [47]:
# string + int concatenation is not possible
'1'+ 1

TypeError: can only concatenate str (not "int") to str

#### 3.1.1.4) Name Error   
Python interpreter doesn't recognize the name of `a` variable

In [48]:
# call a variable that doesn't exist
print(a)

NameError: name 'a' is not defined

__Important Note:__ Always read the exception name and text so you understand what happened!

#### 3.1.2) Raise Exceptions

The above errors are built-in exceptions. Python also provides us with the possibility to create self-defined exceptions.

Sometimes we want to stop the program when a condition occurs. We can do that by raising an exception.

We're going to use our own exception (not one of the errors you saw above).

Let's print a number only if it's bigger than 5. If the condition is not met, raise an exception

In [49]:
def test(num):
    if num > 5:
        print(num)
    else:
        raise Exception('The number is less than 5')

In [50]:
test(6)

6


In [51]:
test(3)

Exception: The number is less than 5

Or we can also raise exceptions outside functions if we want:

In [25]:
print('This line will be printed')
raise Exception('The number is less than 5')
print('This line will not be printed')

This line will be printed


Exception: The number is less than 5

As you can see, the program will stop working once an exception is raised.

Due to this reason, the print line after the Exception being raised is not executed

#### 3.2) Assert

Instead of writing 

`if condition:
    raise Exception()`

We might use an assert method:

In [53]:
def test(num):
    assert num > 5
    print(num)

Assertions are simply boolean expressions that checks if the condition returns true or false.

If it is true, the program will continue to run and move to the next line of code. On the other hand, if it's false, the program stops and throws an AssertionError.

In [54]:
test(6)

6


In [55]:
test(3)

AssertionError: 

We can also add a message so that the assertion error is easier to understand

In [56]:
def test(num):
    assert num > 5, 'The number has to be bigger than 5'
    print(num)

In [57]:
test(6)

6


In [58]:
test(3)

AssertionError: The number has to be bigger than 5

In [59]:
for line in lines:
    assert line != 'pie for breakfast', "Don't eat pies without me!"
    print(line)

maria

ate an apple



AssertionError: Don't eat pies without me!

For example, we used a lot of asserts to check your exercise notebooks!

We could check if the output of your functions is right, if a length of created strings is correct etc.

If your code is in line with what we expected and produces the correct output, the program continues to run and no error is raised. 

However, if your code is not correct, an AssertionError is raised! So if you don't see any error when running the test cells with a bunch of asserts, it means that your solution is accurate. Congrats!

###  3.3) Handling Exceptions

Great. But sometimes (and even often) we don't want to stop the program if something wrong happens. 

Why?

Well, because we can might foreseen that an error might occur and what to come up with ways to deal with it.

We use try/except for that. The syntax is the following:

`try something:
    if there is no exception, do things
except <exception_type>:
    if there is this type of exception, do other things`

Exception types are the same things you saw in point 3.1.1

There is a link to more exception types in the end of this notebook.

If we don't specify the exception type, any exception will be catched.

For example, let's iterate over a list of number and add 1 to each of them. If there is any non-numeric element in the list, let's catch this element and say that we can't add 1 to a non-numeric value.

In [60]:
def test(array):
    for element in array:
        try:
            print(element + 1)
        except TypeError:
            print(element, ' is not a number. We cannnot add 1 to not a number')

In [61]:
test([1,2,3, 'a'])

2
3
4
a  is not a number. We cannnot add 1 to not a number


Try/except are very commonly used, so remember about them!
***

Awesome, you're done with this assignment!

We hope it was useful!

In order to practice the things you learned, solve the practical exercise. 

Don't get confused, as it's going to require a bit more things that you learn in this lecture.

It's really important to learn how to google things, so do it until you find the right answer. And may the force be with you!

# Additional materials:
- [more about reading files in python](https://stackabuse.com/reading-files-with-python/)
- [even more about reading files](https://realpython.com/read-write-files-python/)
- [more about strings in python](https://realpython.com/python-strings/)
- [additional methods to handle strings in python](https://towardsdatascience.com/useful-string-methods-in-python-5047ea4d3f90)
- [python exceptions documentation](https://docs.python.org/3/tutorial/errors.html)
- [python built-in exceptions](https://docs.python.org/2/library/exceptions.html)