# Introduction to Python for Data Science
### Tomasz Rodak
## Lab IV

2024/2025, winter semester

---

## Literature


* [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
* [Dive Into Python 3](https://diveintopython3.net/index.html)
* [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/)
* [Python 3 documentation](https://docs.python.org/3/index.html)



## Working with files

Python provides a built-in `open()` function to open a file. The most important arguments of the `open()` function are 
* string `filename` - the combination of the directory path and the name of the file to open;
* string `mode` - the mode in which the file should be opened.

By default, the `open()` function opens the file in readonly text mode (`'r'`). The function returns a *file object*. Depending on the mode, the file object can be used to read from, write to, or append to the file. When you are done with the file, you should close it using the `close()` method of the file object.

Assume we want to create a file `example.txt` with the following content:

```
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
```

The following code creates the file and writes the content to it:

```python
>>> f = open('example.txt', 'w', encoding='utf-8')
>>> f.write('Lorem ipsum dolor sit amet, consectetur adipiscing elit,\n')
>>> f.write('sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n')
>>> f.write('Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris\n')
>>> f.write('nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in\n')
>>> f.write('reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla\n')
>>> f.write('pariatur. Excepteur sint occaecat cupidatat non proident, sunt in\n')
>>> f.write('culpa qui officia deserunt mollit anim id est laborum.\n')
>>> f.close()
```


Then the following code reads the contents of the file and prints it on the screen:

```python
>>> f = open('example.txt', encoding='utf-8')
>>> text = f.read()
>>> f.close()
>>> print(text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
```

In the example above:
* `f` is the file object,
* `f.write()` writes the string to the file (because the mode is `'w'`, the file is opened for writing in text mode),
* `f.read()` reads the contents of the entire file into a string (because the mode is `'r'`),
* `f.close()` closes the file.

### Exercise 4.1

Check the above code in the Python interpreter. 

---

### Working directory

You may have noticed that the code above does not specify the full path to the file. The reson is that the file is created in the working directory by default. The Thonny IDE, like many other IDEs, has a concept of a working directory. The working directory is the directory from which the program is run. When you open a file without specifying the full path, Python looks for the file in the working directory. You can check the working directory by running the following code:

```python
>>> import os
>>> print(os.getcwd())
```

You can change the working directory using the `os.chdir()` function:

```python
>>> os.chdir('C:/Users/John/Documents')
```

### Text Encoding

Text encoding is a mapping between characters and bytes. Different encodings map characters to byte sequences in different ways. The most common encodings are:

* **ASCII**: Uses 7 bits to represent characters. It can encode 128 characters, including English letters, digits, and some control characters.
* **UTF-8**: A variable-width encoding that can represent every character in the Unicode character set. It uses one to four bytes for each character.
* **UTF-16**: Another variable-width encoding that uses two or four bytes for each character.
* **ISO-8859-1**: Also known as Latin-1, it uses 8 bits to represent characters and can encode 256 characters.

When working with text files, it is important to specify the correct encoding to ensure that the file is read and written correctly. In Python, you can specify the encoding when opening a file using the `encoding` parameter of the `open()` function.

### Exercise 4.2

Write a program `read_file.py`. The program should ask the user for the name and encoding of the file to read. Then the program should read the contents of the file and print them on the screen.

In a text editor create a sample text file with arbitrary content. Run the program and check if it reads the contents of the file correctly.

---

### Exercise 4.3

Write a program `typewriter.py`. The program should ask the user for the name and encoding of the file to write. Then the program takes the text from the user line by line and writes it to the appropriate file. The program should stop when the user enters an empty line. When the program finishes, it should print the number of characters written to the file.

Run the program and check if it writes the contents of the file correctly.

Example:

```
>>> %Run typewriter.py
Enter the name of the file: example.txt
Enter the encoding of the file: utf-8
Enter the text to write to the file (empty line to finish):
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.

Written 447 characters to example.txt
```

---

### Working with text files

The `open()` function, when used with a text mode, returns a file object of the `TextIOWrapper` class. This is a stream object that can be used to read from, write to, or append to the file depending on the mode in which the file was opened. 


The most common methods for reading from a file are `read()`, `readline()`, and `readlines()` and they have the following meanings in text mode:
* `read()` - reads the entire file into a string (optionally, you can specify the number of characters to read),
* `readline()` - reads a single line from the file into a string,
* `readlines()` - reads all lines from the file into a list of strings.

For reading line by line, the `for` loop can be used:

```python
>>> f = open('example.txt', encoding='utf-8')
>>> for line in f:
...     print(line, end='')
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
>>> f.close()
```

For the next set of exercises, download the text file [Three Musketeers](https://www.gutenberg.org/ebooks/1257) by Alexandre Dumas from the Project Gutenberg website. Save the file as `three_musketeers.txt`.

### Exercise 4.4

Answer the following questions about the `three_musketeers.txt` file:

* Line is a sequence of characters terminated by a newline character. How many lines are there in the file?
* Word is a sequence of characters separated by whitespace. How many words are in the file?
* How many characters are in the file?
* What is the longest line in the file (in terms of the number of characters)?
* What is the longest word in the file (in terms of the number of characters)?

---

### Exercise 4.5

Write a program `text_file_stats.py`. The program should ask the user for the name of a text file and then print the following statistics about the file:
* the number of lines (line is a sequence of characters terminated by a newline character),
* the number of words (word is a sequence of characters separated by whitespace),
* the number of english letters (treat upper and lower case letters as the same, *Hint*: use the `str.lower()` method),
* the number of vowels (again, treat upper and lower case letters as the same).

Check the program on some simple text files and on the `three_musketeers.txt` file.

Example output:

```
>>> %Run text_file_stats.py
Enter the name of the file: example.txt
Number of lines: 7
Number of words: 70
Number of letters: 366
Number of vowels: 108
```

---


### Dictionary

For the next exercise it will be very convenient to use a dictionary. A dictionary is a collection of key-value pairs. Each key is associated with a value. Dictionaries are unordered, mutable (you can change their content), and indexed by keys. The keys must be unique and immutable (strings, numbers, or tuples). The values can be of any type.

You can create a dictionary by placing a comma-separated list of key-value pairs within curly braces `{}`. The
syntax for a key-value pair is `key: value`. For example:

```python
>>> d = {'name': 'John', 'age': 30, 'city': 'New York'}
>>> d
{'name': 'John', 'age': 30, 'city': 'New York'} 
```

You can access the value associated with a key by using the key in square brackets:

```python
>>> d['name']
'John'
```

You can add a new key-value pair to a dictionary by assigning a value to a new key:

```python
>>> d['married'] = True
>>> d
{'name': 'John', 'age': 30, 'city': 'New York', 'married': True}
```

Here is an example of how to store the frequency of each letter in a string in a dictionary:

```python
>>> text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
>>> freq = {}
>>> for char in text:
...     if char in freq:
...         freq[char] += 1
...     else:
...         freq[char] = 1
>>> freq
{'L': 1, 'o': 4, 'r': 3, 'e': 5, 'm': 3, ' ': 7, 'i': 6, 'p': 2, 's': 4, 'u': 2, 'd': 2, 'l': 2, 't': 5, 'a': 2, ',': 1, 'c': 3, 'n': 2, 'g': 1, '.': 1}
```

Accessing the keys, values, and key-value pairs of a dictionary can be done using the `keys()`, `values()`, and `items()` methods, respectively:

```python
>>> freq.keys()
dict_keys(['L', 'o', 'r', 'e', 'm', ' ', 'i', 'p', 's', 'u', 'd', 'l', 't', 'a', ',', 'c', 'n', 'g', '.'])
>>> freq.values()
dict_values([1, 4, 3, 5, 3, 7, 6, 2, 4, 2, 2, 2, 5, 2, 1, 3, 2, 1, 1])
>>> freq.items()
dict_items([('L', 1), ('o', 4), ('r', 3), ('e', 5), ('m', 3), (' ', 7), ('i', 6), ('p', 2), ('s', 4), ('u', 2), ('d', 2), ('l', 2), ('t', 5), ('a', 2), (',', 1), ('c', 3), ('n', 2), ('g', 1), ('.', 1)])
```

Assume we want to see the sorted frequency of characters in the text. To this end, we can swap the keys and values, store them in a list, and then sort the list:

```python
>>> freq_list = [(value, key) for key, value in freq.items()]
>>> freq_list.sort(reverse=True)
>>> freq_list
[(7, ' '), (6, 'i'), (5, 't'), (5, 'e'), (4, 's'), (4, 'o'), (3, 'r'), (3, 'm'), (3, 'c'), (2, 'u'), (2, 'p'), (2, 'n'), (2, 'l'), (2, 'd'), (2, 'a'), (1, 'g'), (1, 'L'), (1, '.'), (1, ',')]
```

Here

* `[(value, key) for key, value in freq.items()]` is a list comprehension that creates a list of tuples `(value, key)` from the key-value pairs in the dictionary `freq`. Since `dict.items()` returns the key-value pairs as tuples `(key, value)`, we may iterate over the pairs as `for key, value in freq.items()`.
* `freq_list.sort(reverse=True)` sorts the list in descending order (default is ascending order).

### Exercise 4.6

Write a program `frequency_table.py`. The program asks the user for the name of a text file and its encoding. Then the program creates a frequency table of english letters (upper and lower case treated as the same) in the file. The table should have a form of a Markdown table with two columns: the letter and its frequency and should be sorted in descending order of frequency.

For example, for the text `Abracadabra`, the table should look as follows:

```
| Character | Frequency |
|-----------|-----------|
| a         | 5         |
| b         | 2         |
| r         | 2         |
| c         | 1         |
| d         | 1         |
```

Check the program on some simple text files and on the `three_musketeers.txt` file.

You may paste the table into an online Markdown previewer like [this one](https://markdownlivepreview.com/).

---