<a href="https://colab.research.google.com/github/w4bo/AA2425-unibo-mldm/blob/master/slides/lab-00-introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Opening a file and read the content**
In this block, we learn how to open a file with Python and how to read the content.

**It is absolutely recommended to read the documentation relating to the functions and methods used!**
Usually, it is sufficient type on Google the name of the function (and eventually the name of the library used).



### Locate the file we want to use.
The file is `shapes.txt`: this is the file of the dataset that we are going to use in our exercises.

There are two options to use the file in a Colab project:
- Upload the file through the Colab GUI (temporary upload!);
- Upload the file on your Google Drive (you have to use the same Google account that you use with Colab)). Then, it is necessary to mount the drive in your Colab machine (use the "Mount Drive" button.);
- Use the following code snippet
<br>


In [None]:
from google.colab import files
uploaded = files.upload()

The default upload location of the file is in `/content`in the Google virtual machine.

In [None]:
file_path = '/content/shapes.txt' # equivalent to: 'shapes.txt'

### First way to open a file
Remember to close the file once read!
<br>
**Tools**:

* `open(<path>, <mode>)`: open the file located in the `path`, for reading, writing, ect. depending by the `mode`.

In [None]:
f = open(file_path, 'r')
lines = f.readlines()
f.close()
print('Read {} lines'.format(len(lines)))

### Second way to open a file
In this case, the file is auotomatically closed.

In [None]:
with open(file_path, 'r') as f:
    lines = f.readlines()
print('Read {} lines'.format(len(lines)))

# **File content processing**
In this block, we learn how to process the content of a file.

### Read the first two lines
Note that each line has the following format: `image_name` `coordinates`.
Coordinates are organized as follows: $p^1_x \quad p^1_y \quad p^2_x \quad p^2_y \quad ... \quad p^n_x \quad p^n_x$,  $\quad 3 \leq n \leq 4$, since we have the classes:

*   Triangle
*   Rectangle
*   Square
*   Rhombus

Example: `0_triangle.png 114 221 152 189 223 30`  

**Tools**:
- `strip()`: removes both the leading and the trailing characters.
- `split()`: breaks up a string at the specified separator and returns a list of strings.
- Slicing: `list[start:stop:step]`

Examples:

*   `a[start:stop]  # items start through stop-1`
*   `a[start:]      # items start through the rest of the array`
*   `a[:stop]       # items from the beginning through stop-1`
*   `a[:]           # a copy of the whole array`
*   `a[-1]    # last item in the array`
*   `a[-2:]   # last two items in the array`
*   `a[:-2]   # everything except the last two items`
*    `a[::-1]    # all items in the array, reversed`

In [None]:
for line in lines[:2]:

    print("Raw content:", line)

    content = line.strip()
    print("Content after strip:", content)

    content = content.split(' ')
    print("Content after split:", content)

    # now the content is a list with the splitted elements
    image_name = content[0]   # the image name is in the first position
    coordinates = content[1:] # coordinates in the following part

    # print the content
    print('Name:', image_name, 'Coords:', coordinates)
    print('---')

Note that the element in the list are strings. We must convert these strings in integers (`int`). These integers are then saved in a list.

**Tools**:
- List comprehension: `[expression for item in list]`

In [None]:
print('Type of the coordinates:', type(coordinates[0]))
coordinates = [int(x) for x in coordinates]
print('Name:', image_name, 'Coords:', coordinates)
print('Type of the coordinates:', type(coordinates[0]))

### Exercise
Read the content of the file `shapes.txt` and **accumulate** the image names and coordinates (integers) in two different lists, named `image_names` and `shape_coordinates`. Output the final length of the two lists.

In [None]:
# write the code here

# **Understanding the directory organization of the dataset**

We will use the directory of the dataset that we are going to use in our exercises.

Also in that case, we have many options to upload the content:

1.   Upload the `.zip` file containing the *Euclid* dataset from the left panel (first icon on the top bar)

2.   Unzip the file using the following comand. Dataset folders will appear in `/content/sample_data`

In [None]:
!unzip -q Euclid_dataset.zip -d /content/dataset

A new folder is in your virtual machine (click the "Reload" icon if you do not see it).

You can directly upload the dataset folder on your Google Drive account (you may have mounted the Drive before).

Let's define the path of the dataset

In [None]:
dataset_path = '/content/dataset/Euclid_dataset' # /content/driver/... if the dataset is in the Drive folder

### Count the number of images in each dataset folder

**Tools**:

- `glob(<pathname>)`: return a possibly-empty list of path names that match pathname. You can also use wildcards (`*`,` ?`, `[ranges]`) apart from exact string search to make path retrieval more simple and convenient.
- `os.path.join`: join one or more path components



Example of the `join` command.

In [None]:
from os.path import join
s1 = 'path1'
s2 = 'path2'
s3 = 'path3'
print('Final path:', join(s1, s2, s3))

Example of the `glob` command

In [None]:
from glob import glob
# we list the directories inside the dataset main directory
elements = glob(join(dataset_path, '*'))
print(len(elements), elements)


Now, we are ready to count the number of images in each folder of the dataset.

In [None]:
shapes = ['triangle', 'rectangle', 'square', 'rhombus']
for shape in shapes:
    images = glob(join(dataset_path, shape, '*.png'))
    print('Total amount of {}: \t {}'.format(shape, len(images)))

Let's compute the total amount of images.

In [None]:
images = glob(join(dataset_path, '*', '*.png'))
print('Total amount of images: {}'.format(len(images)))
print('List of images:', sorted(images))

### Exercise
For each element of the Euclid dataset, compute:

*   **Perimeter**
*   **Area**
*   **Centroid** (barycenter)

Save the results in a `results.txt` file. Each line of the file must correspond to the same line of the `shape.txt` file and must follows this format:

`<perimeter> <area> <(x,y)>`

Example:
423 433.5 (234,456)

In [None]:
# write the code here