# OS & PIL

In this notebook we'll briefly introduce two useful libraries for working with image data sets, namely `os` for working with multiple files, and `PIL` for processing images. For this we'll use a subset of the plankton data from the Kaggle Data Science Bowl, where scientists provided over 30,000 grayscale images of plankton and identified them as one of 121 species.

In [None]:
import os

from PIL import Image, ImageDraw
from notebook_checker import start_checks

# Start automatic globals checks
%start_checks

# An introduction to the `os` module

In data science and machine learning, managing data files and directories is a fundamental skill. For example, looping through all the files in some data set is a very common operation, and the `os` module provides the basic tools to do this with.

A common issue is that code must be useable on different operating systems, where the underlying file system might have a completely different structure. For example, on Windows a file path might look like `C:\Users\name\Documents\test.txt` , whereas or Mac or Linux, it will look something like `/home/name/Documents/test.txt` . Note the different overall structure and the different slashes used to separate folders.

Whether you're working on Windows, Mac, or Linux, `os` provides the same unified interface to the underlying file system. You can use `os` to perform tasks like navigating the file system, reading file properties, managing directories, etcetera. For now, the basic functions in the `os` module are:

[`os.listdir(path)`](https://docs.python.org/3/library/os.html#os.listdir) - Returns a list containing the names of all the files and folders in the directory given by `path`. This is the Python equivalent of typing the command `ls` in your terminal.

[`os.path.isdir(path)`](https://docs.python.org/3/library/os.path.html#os.path.isdir) - Returns `True` if the `path` is an existing directory.

[`os.path.isfile(path)`](https://docs.python.org/3/library/os.path.html#os.path.isfile) - Returns `True` if the `path` is an existing regular file.

[`os.path.join(path, *paths)`](https://docs.python.org/3/library/os.path.html#os.path.join) - Joins one or more `path` components intelligently. You can give this function any number of `path` strings, and it will join them together to one complete new `path`.

This last function is extremely useful when nagivating through multiple folders, as it will also adapt its result based on the Operating System your machine is running (which is what OS stands for). For example: `os.path.join('plankton', 'amphipods')` will give you the string `'plankton/amphipods'` on a Mac, but `plankton\amphipods` on a Windows machine.

We will use these functions to examine and check the structure of our dataset.

## Preliminary analysis

Before you do any coding on a large collection of files, it is always a good idea to inspect the files and folders manually. That way, you can better interpret the results you get back from you code. Open up the folder `plankton` on you computer, and take a look at some of these images and the folder's structure.

**Q1. Describe the structure of the `plankton` folder and list anything that stands out.**

*Your answer goes here.*

## Listing folders

Write a function `list_folders(directory)` that accepts a `directory` and returns a list of all folders within that `directory`. Use `os.listdir` to get the contents of the `directory`. Then, loop and use `os.path.join` and then `os.path.isdir` to make sure that each of the elements returned by `os.listdir` is actually a folder. Finally, return a list of all elements that are a folder.

You should end up with 121 different folders; one for each of the 121 different classes within the dataset.

In [None]:
def list_folders(directory):
    # YOUR CODE HERE

main_directory = 'plankton'
classes = list_folders(main_directory)
N = 5

print(f'In total, the directory {main_directory} contains {len(classes)} classes. The first {N} of these classes are: \n{classes[:N]}')

## Counting files in a folder

In your quick analysis of content of the `plankton` folder you might have noticed that the number of images for each class isn't always equal. Write the function `count_files(directory)`. This function should accept a `directory` and count the number of files that are inside this directory. This function should be very similar to your previous function.

Check your code by manually comparing the resulting number of images with the number of images in each of the example folders.

In [None]:
def count_files(directory):
    # YOUR CODE HERE

    
for species in classes[:N]:
    species_directory = os.path.join(main_directory, species)
    species_image_count = count_files(species_directory)
    
    print(f'The directory {species_directory} contains {species_image_count} images.')

## `os.walk`

The most powerfull of the all the OS functions is `os.walk`, as it can traverse all the folders inside some folder, and then the folders inside those folders, and so on. In this case our data set is simple enough that we probably would not need to use `os.walk` to navigate it, but it is a lifesaver for complex data sets with many nested folders.

The function [`os.walk`](https://docs.python.org/3/library/os.html#os.walk) will step through every folder inside a path. At each step, the function returns 3 things:

* `dirpath` The path to the folder you are currently in at this step. This tells you how to get from the top folder to the current folder for this step.
* `dirnames` A list of all the folders that are inside this folder.
* `filenames` A list of all the files that are inside this folder.

Take a look at the example below, which combines a lot of the features from the two functions above, in just a few lines:

In [None]:
count = 0

for (dirpath, dirnames, filenames) in os.walk(main_directory):
    print(f'Visiting: {dirpath}, which has {len(dirnames)} folders and {len(filenames)} files.')
    count += 1

print(f'\nUsing os.walk, {count} directories were visited.')

## Finding all imagepaths using `os.walk`

Using `os.walk`, write a function `find_images(directory)` that finds all images within a `directory`. The function should return a list of the paths to each of the images in all of the folders. Each image path in this list should be the complete path from the `directory`, and so should also include the class folder. The results should look something like: `'plankton/<class_name>/<number>.jpg'` on Mac or Linux, and `'plankton\<class_name>\<number>.jpg'` on Windows.

**Hint:** You can check if some file is an image by checking the extension of the file! In our case, all images have the `'.jpg'` extension.

In [None]:
def find_images(directory):
    # YOUR CODE HERE

image_paths = find_images(main_directory)
M = 5

print(f'The first {M} image paths are:')
for image_path in image_paths[:M]:
    print(f'- {image_path}')

print(f'\nThe total number of images is: {len(image_paths)}')

# An introduction to the `PIL` module

We will display images using the Python Imaging Library (`PIL`). This library adds support for opening, manipulating, and saving most image file formats. Using this library simplifies loading and displaying the images from our dataset, and enables us to modify them if needed.

In the cell below, you will find an example of how `PIL` can be used to display a single image. 

In [None]:
image_index = 3000
image_path = image_paths[image_index]

print(image_path)

plankton_image = Image.open(image_path)
display(plankton_image)

We can then use `PIL` to perform some common image operations, like

- [`Image.rotate(angle)`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.rotate): Rotate the image by `angle` degrees and return the new image.
- [`Image.resize(size)`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.resize): Resize the image to the shape defined by the tuple `size` and return the new image.
- [`Image.crop(box)`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.crop): Crop (cut out) part of the image defined by the rectangle `box` and return the new image.
- [`Image.copy()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.copy): Create a copy of the image and return the new image.

For example, the code below resizes the selected image to 400 by 400 pixels, and then rotates the result 270 degrees.

In [None]:
scaled_image = plankton_image.resize((400, 400))
rotated_image = scaled_image.rotate(270)

display(rotated_image)

By using [`ImageDraw`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html), we can also draw basic shapes on the image using `PIL`. Most of the shapes for `Imagedraw` use a *bounding box* to place the shape. This should be a rectangle (*box*) that fits around the edges (*bounds*) of the shape. Bounding boxes are defined as a list of coordinates like:

    [x0, y0, x1, y1]  or  [(x0, y0), (x1, y1)]

where the coordinate $(0, 0)$ is always the top left corner of the image, and $x_0 <= x_1$ and $y_0 <= y_1$. This means `(x0, y0)` should be the top left coordinate of the bounding box, and `(x1, y1)` should be the bottom right coordinate. Some of the most common `ImageDraw` functions that use bounding boxes (reffered to as `xy`) are:

- [`ImageDraw.rectangle(xy)`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html#PIL.ImageDraw.ImageDraw.rectangle): Draw the rectangle defined by the bounding box `xy`.
- [`ImageDraw.ellipse(xy)`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html#PIL.ImageDraw.ImageDraw.ellipse): Draw the ellipse inside the bounding box `xy`. Note that if the bounding box is a square, the result will be a circle.
- [`ImageDraw.line(xy)`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html#PIL.ImageDraw.ImageDraw.line): Draw a line from the first coordinate to the second coordinate in the bounding box `xy`.
- [`ImageDraw.arc(xy, start, end)`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html#PIL.ImageDraw.ImageDraw.arc): Draw an arc by only drawing part of an ellipse, which is defined by the bounding box `xy`. Starting from the 3 o'clock position moving clockwise, only draw the ellipse from `start` angle (in degrees) until `end` angle.

These shapes can be used to highlight important parts of an image, or we can even use `ImageDraw` to add some clarifying text in the image. For now, we'll just briefly experiment with the usage of some of these draw functions. 

**Draw a face on the plankton of your choice, using at least 3 different shapes.**

*Note: These plankton images are grayscale, so you can only draw on them in black and white. If you wish to draw in color, you can replace the `rotated_image.copy()` with a call to [`Image.convert(mode)`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.convert) and convert the image to `'RGB'` first.*

In [None]:
# Make a copy to preserve the original image
drawing_image = rotated_image.copy()

# Create an ImageDraw that can be used to edit the image
image_canvas = ImageDraw.Draw(drawing_image)

# YOUR CODE HERE

# ImageDraw modifies the input image, so display the modified copy
display(drawing_image)