# Image Analysis - Unit 01- Toy Datasets

## Lesson Learning Outcome

* **Image Analysis Lesson is made of 3 units**
* By the end of this lesson, you should be able to:
  * Evaluate Labels Distribution
  * Perform an Image Montage
  * Check Average Image and Image Variability
  * Check Contrast between 2 average images
  * Work with toy and real datasets
  * Understand the differences in terms of folder structure when downloading real image datasets

---

## Unit Objectives

* Use a built-in toy dataset and explore: label distribution, deliver an image montage, conduct average image, image variability and contrast between 2 average images studies



---

Data Science has incredible applications when dealing with images, either static, like a photo, or dynamic, like a video.


 **Why do we study Image Analysis?**
  * Because it is part of an effective EDA (Exploratory Data Analysis) on images to perform tasks like understanding label distribution, conducting an image montage, compute average image and image variability.



---

## Import Packages for Learning


In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
print('loaded')

loaded



##  Image Analysis - Toy Dataset 


* We haven't studied TensorFlow yet, but we will use TensorFlow toy datasets in this lesson. These are datasets used for learning, meaning they will be useful for understanding the concepts for Image Analysis.
* Later in this lesson, we will use real images, where there are additional processes before analyzing the images


* For now, we will use a dataset called mnist, which is a collection of handwritten numbers from 0 to 9, all in 28 x 28 pixels
* We will load the data into a train and test sets


In [3]:
import numpy as np
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()


2024-07-08 10:35:53.467374: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-08 10:35:53.492732: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 10:35:53.562093: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 10:35:53.562903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


RuntimeError: module compiled against API version 0xf but this version of numpy is 0xd

ImportError: initialization failed


Let's check the train and test set size
* We will notice the image has only one channel (greyscale). 
* If it were coloured, it would show (60000, 28, 28, 3) for RGB or (60000, 28, 28, 4) for RGBA

print(x_train.shape)
print(x_test.shape)

As we expect, the data is a NumPy array

type(x_train)

We are using the function `plt.imshow()` to display a given image. The documentation is found [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html)
* We will subset the array using a pointer (num variable). We randomly choose 27
* `plt.imshow()` gets the array to be displayed, and, in this case, we set `cmap='gray'` since it is a greyscale image
* We can see the number in the image and the respective actual value in the `y_train`

pointer = 27

print(f"array pointer = {pointer}")
print(f"x_train[{pointer}] shape: {x_train[pointer].shape}")
print(f"label: {y_train[pointer]}")

plt.imshow(x_train[pointer],cmap='gray')
plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE** play around with `pointer` by setting it to other values. 
* What is the max value you can use in this case?

Also, remove `cmap='gray'` and check the difference. When you don't set this parameter, it will show the default option: `'viridis'`.


You can use `set()` to check the unique values in an array. That allows us to understand the labels present in the train set

set(y_train)

In the cell below, assign a value to a variable named ``pointer``
* Use plt.imshow, x_train, pointer, and cmap
* Show plt

# Write your code here.


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Labels Distribution


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the convention **`label`**, as the levels or classes in a image dataset.
* For example, in mnist dataset, the labels are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

We are interested in knowing if the target variable is balanced and understanding if the labels have similar frequency levels.
* We assess that with `.value_counts()` and a bar plot.
* We first convert the y_train array to a Pandas Series, then count the values, sort the index and plot with Pandas
* We notice the labels are fairly distributed.

sns.set_style('whitegrid')
pd.Series(data=y_train).value_counts().sort_index().plot(kind='bar',figsize=(12,5))
plt.title("Train Set: Labels Distribution")
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Image Montage


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> An Image Montage aims to display a grid of images per label
* We created a custom function for this task. We will not describe the specifics of the function. The function was made using the knowledge covered in the course and functionalities for creating a list with indices pair to plot in the image grid and randomly subset several images to be displayed

* The function arguments are: 
  * `X`: NumPy array with image data, 
  * `y`: NumPy array with target value, label_to_display, 
  * `nrows` and `ncols` to define the grid structure and 
  * `figsize`.


Read the pseudo code to understand the function capabilities better
  * It is resonable if, at first, you don't understand all the code from the function below. The central point is to make sense of the pseudo-code and understand the function parameters.

import itertools
import random

def image_montage_data_as_array(X, y,label_to_display, nrows, ncols, figsize=(15,10)):
  """
   The pseudo code for the function is:
  * Subset the label you are interested
  * If the label is not in the target array, shows montage with all labels
  * Check if your grid space is greater than the subset (nrows x ncols) size
  * Create list of axes indices based on nrows and ncols
  * Create a Figure and display images

  """
  sns.set_style("white")

  # subset the label you are interested in displaying
  if label_to_display in np.unique(y):
    y = y.reshape(-1,1,1)
    boolean_mask = np.any(y==label_to_display,axis=1).reshape(-1)
    df = X[boolean_mask]

  # if that label is not in the data, it shows a montage with all labels
  else:
    print("The class you selected doesn't exist.")
    print(f"The existing options are: {np.unique(y)}")
    print("Find below a montage with all labels")
    df = X

  # checks if your montage space is greater than subset size
  if nrows * ncols < df.shape[0]:
    img_idx = random.sample(range(0, df.shape[0]), nrows * ncols)
  else:
    print(
        f"Decrease nrows or ncols to create your montage. \n"
        f"There are {df.shape[0]} in your subset. "
        f"You requested a montage with {nrows * ncols} spaces")
    return
    
  # create list of axes indices based on nrows and ncols
  list_rows= range(0,nrows)
  list_cols= range(0,ncols)
  plot_idx = list(itertools.product(list_rows,list_cols))

  # create a Figure and display images
  fig, axes = plt.subplots(nrows=nrows,ncols=ncols, figsize=figsize)
  for x in range(0,nrows*ncols):
    axes[plot_idx[x][0], plot_idx[x][1]].imshow(df[img_idx[x]], cmap='gray')
  plt.tight_layout()
  plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's display the label 8, in a 3 x 5 grid.
* Note how different the number 8 can be written!

image_montage_data_as_array(X=x_train, y=y_train,
              label_to_display=8,
              nrows=3, ncols=5,
              figsize=(15,10))

Do an exercise and change the label value 

image_montage_data_as_array(X=x_train, y=y_train,
              label_to_display=9,
              nrows=3, ncols=5,
              figsize=(15,10))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE** We will use another builtin dataset from TensorFlow

from tensorflow.keras.datasets import fashion_mnist
(x_practice, y_practice), (x_practice_test, y_practice_test) = fashion_mnist.load_data()


Label	Description
* 0	T-shirt/top
* 1	Trouser
* 2	Pullover
* 3	Dress
* 4	Coat
* 5	Sandal
* 6	Shirt
* 7	Sneaker
* 8	Bag
* 9	Ankle boot

Use your existing knowledge to asses the label distribution

# write the code here to assess label distribution


In the following cell, call the ``image_montage_data_as_array`` custom function to make an image montage
* choose a label from 0-9 

# Write your code here.


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Average Image and Image Variability per Label

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We noticed that for each label, the images would be slightly different from each other, but in general, we expect them to have a pattern

An Average Image and Image Variability per label helps to study these patterns
* An average image is when you subset all data (NumPy arrays) from a given label and calculate the average from the array values
* Image Variability is when you subset all data (NumPy arrays) from a given label and calculate the standard deviation from the array values


Read the pseudo-code to understand the function capabilities better
  * It is reasonable if, at first, you don't understand all the code from the function below. The central point is to make sense of the pseudo-code and understand the function parameters.


def image_avg_and_variability_data_as_array(X, y, figsize=(12,5)):
  """
   The pseudo-code for the function is:
  * Loop through all labels
  * Subset an array for a given label
  * Calculate the average and standard deviation
  * Create a Figure displaying the average and variability image

  """
  sns.set_style("white")

  for label_to_display in np.unique(y):

    y = y.reshape(-1,1,1)
    boolean_mask = np.any(y==label_to_display,axis=1).reshape(-1)
    arr = X[boolean_mask]

    avg_img = np.mean(arr, axis = 0)
    std_img = np.std(arr, axis = 0)
    print(f"==== Label {label_to_display} ====")
    print(f"Image Shape: {avg_img.shape}")
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
    axes[0].set_title(f"Average Image for label {label_to_display}")
    axes[0].imshow(avg_img, cmap='gray')
    axes[1].set_title(f"Image Variability for label {label_to_display}")
    axes[1].imshow(std_img, cmap='gray')
    plt.show()
    print("\n")
  



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To help with the interpretation, consider the following guide:
* Check for the patterns where the colour is darker or lighter
* For **Average Image**, we notice the general patterns for a given label
* For **Image Variability**, the lighter area indicates higher variability in that area. For example, for zero, we see the middle is black (meaning all zeros tend not to have the middle filled), and a circled area is white (meaning the images tend to vary in this circled area)
* You will notice that the plots complement each other since both, from different angles, show the image patterns



image_avg_and_variability_data_as_array(X=x_train, y=y_train, figsize=(12,5))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note: There will be datasets where the images in a given label will have distinct shapes or patterns, and an average and variability study may not give the same amount of insights as we see in the mnist dataset

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For example, your dataset may contain images of fish and birds from multiple species. 
* Eventually, when you subset fishes and calculate an average image, the result will be a combination of patterns from multiple fish species that may confuse a user unfamiliar with the context.

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Contrast between 2 Labels


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Here we are interested in evaluating the contrast between the two labels.
* Which may provide additional insight into how the labels differ from each other

* We created a custom function contrast_between_2_labels_data_as_array() that computes that. The arguments are `X`, for the image data in NumPy array, `y` as the array indicating the label; `label_1` and `label_2` as the labels you are interested in compairing and `figsize` to set figure size

  * It is reasonable if, at first, you don't understand all the code from the function below. The central point is to make sense of the pseudo-code and understand the function parameters.

def subset_image_label(X,y,label_to_display):
  y = y.reshape(-1,1,1)
  boolean_mask = np.any(y==label_to_display,axis=1).reshape(-1)
  df = X[boolean_mask]
  return df

def contrast_between_2_labels_data_as_array(X, y, label_1, label_2, figsize=(12,5)):
  sns.set_style("white")

  if (label_1 not in np.unique(y)) or (label_2 not in np.unique(y)):
    print(f"Either label {label} or label {label_2}, are not in {np.unique(y)} ")
    return

  # calculate mean from label1
  images_label = subset_image_label(X, y, label_1)
  label1_avg = np.mean(images_label, axis = 0)

  # calculate mean from label2
  images_label = subset_image_label(X, y, label_2)
  label2_avg = np.mean(images_label, axis = 0)

  # calculate difference and plot difference, avg label1 and avg label2
  contrast_mean = label1_avg - label2_avg
  fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)
  axes[0].imshow(contrast_mean, cmap='gray')
  axes[0].set_title(f'Difference Between Averages: {label_1} & {label_2}')
  axes[1].imshow(label1_avg, cmap='gray')
  axes[1].set_title(f'Average {label_1}')
  axes[2].imshow(label2_avg, cmap='gray')
  axes[2].set_title(f'Average {label_2}')
  plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To help the interpretation, consider the following guide:
* You are comparing label_1 to label_2
* In the Difference Between Averages plot, the darker area shows where both average images are similar. The lighter area shows where average images are different

contrast_between_2_labels_data_as_array(X=x_train, y=y_train,
                                        label_1=8, label_2=2,
                                        figsize=(12,10)
                                        )

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The same note from the previous section applies here:
* There will be datasets where the images in a given label will have distinct shapes or patterns, and contrast from averages study may not provide the same amount of insights as we see in mnist dataset

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE** Use your existing knowledge to assess the 
* average image, image variability per label, 
* and contrast between labels for x_practice and y_practice data


In [None]:

# write the code here to assess average image, image variability


# write the code here for the contrast between 2 labels 
# We suggest you to try with a few pairs of labels so that you can get comfortable with the data 

# Image Analysis - Unit 02 - Real Datasets - Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Use a real dataset and explore: label distribution, deliver an image montage, conduct average image, image variability and contrast between 2 average images studies



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

# Lets extract the image files we will be using
from zipfile import ZipFile

with ZipFile('Chess.zip', 'r') as chessZip:
   chessZip.extractall()

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Image Analysis - Real Datasets - Part 01

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Real image datasets are different from the toy datasets we find in the ML libraries
* One major difference is the fact that real images will likely have different sizes; for example, you can't expect all images to be in the 100 x 50 pixels format.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> One possible approach is that downloaded images will be arranged in a folder. Each folder will contain a set of sub-folders related to the image labels. In each sub-folder, we find the image files
* Ultimately we want to have three folders: Train, Validation and Test. However, your dataset may come with: 
  * One folder (with subfolders as the labels)
  * Two folders (like Train and Test)
  * Or three folders (with Train, Validation and Test)

* Let's imagine our dataset is called Animals_Image, and there are three labels: Dog, Cat and Parrot. The data could be in one of three formats below:
  * Imagine there aren't only two distinct images in each folder. Instead, there is a set of images for each label or in each folder


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We want our data to be in a 3-folder structure. We will need to move files between folders and do that programmatically. We will cover how to do that in the next unit.

* In this unit, we will use a dataset that comes with one folder only.


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Image Analysis

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the following workflow to start our image analysis study in this unit:
* 1 - Set data directory
* 2 - Delete non-image files
* 3 -  Assess Labels Distribution
* 4 - Build an Image Montage
* 5 - Calculate Average Image and Image Variability
* 6 - Contrast Between 2 Labels

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Set Data Directory

You will locate your data directory
* That is the root path of your data. In this case, the sub-folder option is the dataset name folder: Chess

my_data_dir = 'Chess'
my_data_dir

The labels are assessed based on the folder names in the Chess folder. This is done with the command: `os.listdir()`, where the argument is: `'Chess'`. The documentation is found [here](https://docs.python.org/3/library/os.html#os.listdir)

import os
labels = os.listdir(my_data_dir)
labels

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Labels Distribution

We create custom code that:
* Stores in a DataFrame: the name of the set (in this case, Chess), the label and its frequency
* We plot the DataFrame in a barplot showing the frequencies.

df_freq = pd.DataFrame([]) 
for folder in ['Chess']:   
                  # think 'Chess' as a Set Folder. 
                  # Ideally we want a Train Set, Val Set and Test Set
  for label in labels:
    df_freq = df_freq.append(
        pd.Series(data={'Set': folder,
                        'Label': label,
                        'Frequency':len(os.listdir(folder + '/' + label))}
                  ),
                  ignore_index=True
        )
    

df_freq

We plot the DataFrame using a barplot, where x is the `Set` (in this case is only Chess), y is the `Frequency`, and hue is `Label`
* We notice the label's frequencies are not the same across all labels. There are sections where one is much less than another 


print("\n")
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')
plt.show()

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Image Montage

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Similarly to the previous unit, we want to do an Image Montage on the labels to start understanding the dataset
* The difference is that the dataset is located in folders, not in an array. One key logic difference of this function is that it loops through the image names across different folders and "load and plot" the images in a Figure
* Check the pseudo code to understand the logic
  * It is normal and okay if you don't at first understand all code in the function below. The central point is making sense of the pseudo-code and understanding the function parameters

import itertools
import random
from matplotlib.image import imread


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15,10)):
  """
  logic
  - if a label exists in the folder
  - check if your montage space is greater than nsubset size
  - create a list of axes indices based on nrows and ncols
  - create a Figure and display images
  - in this loop, load the image and plot the given image

  """
  sns.set_style("white")

  labels = os.listdir(dir_path)

  # subset the class you are interested in displaying
  if label_to_display in labels:

    # checks if your montage space is greater than subset size
    images_list = os.listdir(dir_path+'/'+ label_to_display)
    if nrows * ncols < len(images_list):
      img_idx = random.sample(images_list, nrows * ncols)
    else:
      print(
          f"Decrease nrows or ncols to create your montage. \n"
          f"There are {len(images_list)} in your subset. "
          f"You requested a montage with {nrows * ncols} spaces")
      return
    

    # create a list of axes indices based on nrows and ncols
    list_rows= range(0,nrows)
    list_cols= range(0,ncols)
    plot_idx = list(itertools.product(list_rows,list_cols))


    # create a Figure and display images
    fig, axes = plt.subplots(nrows=nrows,ncols=ncols, figsize=figsize)
    for x in range(0,nrows*ncols):
      img = imread(dir_path + '/' + label_to_display + '/' + img_idx[x], 0)
      img_shape = img.shape
      axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
      axes[plot_idx[x][0], plot_idx[x][1]].set_title(f"Width {img_shape[1]}px x Height {img_shape[0]}px")
      axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
      axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
    plt.tight_layout()
    plt.show()


  else:
    print("The label you selected doesn't exist.")
    print(f"The existing options are: {labels}")

We create the logic where we loop over the labels, and for each, we do an image montage
* Note also the dimension of the images are different

for label in labels:
  print(label)
  image_montage(dir_path= my_data_dir,
                label_to_display= label,
                nrows=2, ncols=3,
                figsize=(10,15)
                )
  print("\n")

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Average Image and Image Variability per Label

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  All images must be the same size to compute an average image. First, we need to determine what will be the average image size so that we can load all images in a uniform array
* We loop over the `train_path`, load each image and store the height and width in `dim1` and `dim2`. After, we plot the size of the images in a scatterplot and indicate the average image value for width and height

dim1, dim2 = [], []
for label in labels:
  for image_filename in os.listdir(my_data_dir + '/'+ label):
    img = imread(my_data_dir + '/' + label + '/'+ image_filename, 0)
    img_shape = img.shape
    dim1.append(img_shape[0]) # image height
    dim2.append(img_shape[1]) # image width

sns.set_style("whitegrid")
fig, axes = plt.subplots()
sns.scatterplot(x=dim2, y=dim1, alpha=0.2)
axes.set_xlabel("Width (pixels)")
axes.set_ylabel("Height (pixels)")
dim1_mean = int(np.array(dim1).mean())
dim2_mean = int(np.array(dim2).mean())
axes.axvline(x=dim2_mean,color='r', linestyle='--')
axes.axhline(y=dim1_mean,color='r', linestyle='--')
plt.show()
print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We need to load all images into a uniform array.
* We create a custom function that loops over a directory. In this directory, we find the possible labels as sub-folder. For each label (sub-folder), we load the image, resize to the average width and height we computed earlier and store it in an array. 
* In the end we have X and y arrays, where X stores the image pixels and y labels for each image.
  * It is normal and okay if you don't at first understand all code in the function below. The central point is to make sense of the function parameters.

sns.set_style("white")
from tensorflow.keras.preprocessing import image

def load_image_as_array(my_data_dir, new_size=(50,50), images_amount = 20):
  
  X, y = np.array([], dtype='int'), np.array([], dtype='object')
  labels = os.listdir(my_data_dir)

  for label in labels:
    counter = 0
    for image_filename in os.listdir(my_data_dir + '/' + label):
      if counter < images_amount:
        
        img = image.load_img(my_data_dir + '/' + label + '/' + image_filename, target_size=new_size)
        if image.img_to_array(img).max() > 1: 
          img_resized = image.img_to_array(img) / 255
        else: 
          img_resized = image.img_to_array(img)
        
        X = np.append(X, img_resized).reshape(-1, new_size[0], new_size[1], img_resized.shape[2])
        y = np.append(y, label)
        counter += 1

  return X, y


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The function parameters are:
* `my_data_dir`, we provide train_path (`/content/chess_dataset/Chessman-image-dataset/Chess`),
* `new_size`, which is the average image dimension from this dataset, and 
* `images_amount`, which is the number of images per label you want to load. You should consider that loading, resizing and storing image data will have a considerable computing cost. Here we can load the same amount of images per label and set a value that will not take much time to load. Also you may have memory issues when loading a lot of images, depending on your memory availability


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> It may take 2 or 3 minutes to load all images

X, y = load_image_as_array(my_data_dir=my_data_dir,
                           new_size=(dim1_mean,dim2_mean),
                           images_amount = 2)

We will use `image_avg_and_variability_data_as_array()` to understand the average image and image variability for this dataset. This is the same function we used in the last unit

def image_avg_and_variability_data_as_array(X, y, figsize=(12,5)):
  """
   The pseudo-code for the function is:
  * Loop through all labels
  * Subset an array for a given label
  * Calculate the average and standard deviation
  * Create a Figure displaying the average and variability image

  """
  sns.set_style("white")

  for label_to_display in np.unique(y):

    y = y.reshape(-1,1,1)
    boolean_mask = np.any(y==label_to_display,axis=1).reshape(-1)
    arr = X[boolean_mask]

    avg_img = np.mean(arr, axis = 0)
    std_img = np.std(arr, axis = 0)
    print(f"==== Label {label_to_display} ====")
    print(f"Image Shape: {avg_img.shape}")
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
    axes[0].set_title(f"Average Image for label {label_to_display}")
    axes[0].imshow(avg_img, cmap='gray')
    axes[1].set_title(f"Image Variability for label {label_to_display}")
    axes[1].imshow(std_img, cmap='gray')
    plt.show()
    print("\n")
  

We parse X and y to understand average image and image variability per label
* You will likely notice typical patterns/shapes for pieces like king or knight; However, the images (average and variability) may be too blurred. 
* This happens since we didn't load many images since we just wanted to show the use case. In case you want, get back to `load_image_as_array()`, set a higher value `images_amount`,  and rerun `image_avg_and_variability_data_as_array()`

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To help the interpretation, consider the following guide:
* Check for the patterns where the colour is darker or lighter
* For **Average Image**, we notice the general patterns for a given label
* For **Image Variability**, the lighter area indicates higher variability across images from the same label in that area. 

image_avg_and_variability_data_as_array(X=X, y=y, figsize=(12,5))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note: There will be datasets where the images in a given label will have distinct shapes or patterns, and an average and variability study may not give the same amount of insights as we see in mnist dataset

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For example, your dataset may contain images of fish and birds from multiple species. 
* Eventually, when you subset fishes and calculate an average image, the result will be a combination of patterns from multiple fish species that may confuse a user unfamiliar with the context.

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Contrast between 2 Labels

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We may be at a point in our project where we want to know the differences between 2 classes
* We will use the same function from last unit `contrast_between_2_labels_data_as_array()`

def subset_image_label(X,y,label_to_display):
  y = y.reshape(-1,1,1)
  boolean_mask = np.any(y==label_to_display,axis=1).reshape(-1)
  df = X[boolean_mask]
  return df

def contrast_between_2_labels_data_as_array(X, y, label_1, label_2, figsize=(12,5)):
  sns.set_style("white")

  if (label_1 not in np.unique(y)) or (label_2 not in np.unique(y)):
    print(f"Either label {label} or label {label_2}, are not in {np.unique(y)} ")
    return

  # calculate the mean from label1
  images_label = subset_image_label(X, y, label_1)
  label1_avg = np.mean(images_label, axis = 0)

  # calculate the mean from label2
  images_label = subset_image_label(X, y, label_2)
  label2_avg = np.mean(images_label, axis = 0)

  # calculate the difference and plot the difference, avg label1 and avg label2
  contrast_mean = label1_avg - label2_avg
  fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)
  axes[0].imshow(contrast_mean, cmap='gray')
  axes[0].set_title(f'Difference Between Averages: {label_1} & {label_2}')
  axes[1].imshow(label1_avg, cmap='gray')
  axes[1].set_title(f'Average {label_1}')
  axes[2].imshow(label2_avg, cmap='gray')
  axes[2].set_title(f'Average {label_2}')
  plt.show()

Let's compare King and Knight

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To help the interpretation, consider the following guide:
* You are comparing label_1 to label_2
* In the Difference Between Averages plot, the darker area shows where both average images are similar. The lighter area shows where average images are different
* In this dataset, the contrast may not provide much insight since there is a small number of images per label.

contrast_between_2_labels_data_as_array(X=X, y=y,
                          label_1='King',
                          label_2='Knight',
                          figsize=(15,20)
                          )

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The same note from the previous section applies here:
* There will be datasets where the images in a given label will have distinct shapes or patterns, and contrast from a averages study may not provide the same amount of insights as we see in the chess dataset



# Image Analysis - Unit 03 - Real Datasets - Part 02

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand the differences for image datasets folder structure, and split folders into train, validation and test sets



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

from zipfile import ZipFile

# Let's extract the data we will be using
with ZipFile('Chess.zip', 'r') as chessZip:
   chessZip.extractall()
with ZipFile('Covid19-dataset.zip', 'r') as covidZip:
   covidZip.extractall()

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Image Analysis - Real Datasets - Part 02

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the last unit, we used the Chess dataset, where a single folder hosted a set of sub-folders. Each sub-folder is related to a label. In each subfolder, you found a set of images

* For the ML image classification task, we are interested in having a standardised way to arrange the files in a folder. Before doing any Image Analysis and ML on images, we have to organise the dataset folders.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As a recap, our images could be arranged in these 3 formats
  * One folder (with subfolders as the labels)
  * Two folders (like Train and Test)
  * Or three folders (with Train, Validation and Test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Ultimately we want to have **three folders: Train, Validation and Test**.


The image below shows three potential folder structures. The first on the left shows one folder (with subfolders as the labels), the central piece shows two folders (like Train and Test set folders), and the last on the right shows three folders (with Train, Validation and Test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  Again, we are interested in having a 3-folder structure. We will need to move files between folders programmatically. 
* The reason to have three folders is that for the process of fitting a model, we need 3 data sets: train, validation and test sets; as we have previously studied

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's split into two possibilities:
* The dataset comes with one folder
* The dataset comes with two folders (either Train/Test or Train/Validation). We will stick with the Train/Test convention

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are going to use a workflow:
* 1 - Delete non-image files
* 2 - Split Train, Validation, Test Set


---

### The dataset comes with one folder

We will use the same [Chess dataset](https://www.kaggle.com/niteshfre/chessman-image-dataset) to demonstrate how to arrange it into **three folders: Train, Validation and Test**


---


#### Split Train, Validation, Test Set

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We assume there is one folder that holds a set of folders that represent the label. In each sub-folder, we find the images related to each label.
* Read the pseudo-code function to understand the function objective
  * It is normal and okay if, at first, you don't understand all the code from the function below. The central point is to make sense of the pseudo-code and understand the function parameters.

import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    logic
    - There is one folder that holds a set of folders that represent the label. 
    In each sub-folder, we find the images related to each label
    - you provide the ratio for train, validation and test set. they should sum 1.0
    - it will generate three folders (train, validation and test). In each folder, 
    there will be a set of subfolders related to each label. The proportion of a given 
    label across folders (train, validation, set), is set with train_set_ratio, validation_set_ratio, 
    and test_set_ratio parameters
    """

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
        return

    # gets labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # hack: sometimes, in a jupyter notebook session, a temporary/invisible
        # folder called .ipynb_checkpoints appears; we don't want it, so we remove it from the labels list
        labels = [item for item in labels if '.ipynb_checkpoints' not in item]

        # create the train, validation, and test folders with labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move the given file to a train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move the given file to a validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move the given file to the test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1
            os.rmdir(my_data_dir + '/' + label)


Let's display the folders within the Chess folder.

!ls Chess

Within the Chess folder, we see there are three further folders, these are related to the dataset labels, let's display the content of the folder named Bishop

* We now see the images inside the Bishop folder; notice we have jpg, gif, and png file extensions

!ls Chess/Bishop

We will split allocating 65% to the train set, 15% to validation, and 20% to the test set

split_train_validation_test_images(my_data_dir='Chess',
                                   train_set_ratio=0.65,
                                   validation_set_ratio=0.15,
                                   test_set_ratio=0.2)

You can see how the folder structure changed, and now within the Chess folder, we have the folders test, train, validation

!ls Chess

If you look inside the train folder, you will see folders for Bishop, King, and Knight

!ls Chess/train

We will use [covid19 image datasets](https://www.kaggle.com/pranavraikokte/covid19-image-dataset), which are made using a set of Chest X-rays arranged in the Train and Test Set
* We will demonstrate how to arrange into three folders: Train, Validation and Test

---

#### Split Train, Validation, Test Set

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The folders are split into Train and Test sets already. 
* We want to get part of the data from the train and assign it to a validation folder
  * It is normal and okay if, at first, you don't understand all the code from the function below. The central point is to make sense of the pseudo-code and understand the function parameters.

!ls Covid19-dataset

import os
import shutil
import random
import joblib


def split_validation_from_train_set(my_data_dir, train_set_folder_name, train_set_ratio):

    if train_set_ratio >= 1.0 or train_set_ratio < 0:
        print("train_set_ratio should be positive and smaller than 1.0")
        return

    # define the train set dir
    train_set_dir = my_data_dir + '/' + train_set_folder_name
    directory_list = os.listdir(my_data_dir)

    # gets the labels
    labels = os.listdir(train_set_dir)
    
    if 'validation' in directory_list:
        pass
    else:

        # hack: sometimes, in a jupyter notebook session, a temporary/invisible
        # folder called .ipynb_checkpoints appears; we don't want it, so we remove it from the labels list
        labels = [item for item in labels if '.ipynb_checkpoints' not in item]

        for label in labels:  # create a validation folder
            os.makedirs(name=my_data_dir + '/validation/' + label)

        for label in labels:
            files = os.listdir(train_set_dir + '/' + label)
            random.shuffle(files)
            train_set_files_qty = int(len(files) * (1-train_set_ratio))

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move the given file to a validation set
                    shutil.move(train_set_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                count += 1


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We provide the data directory, the train_set_folder_name and a train_set_ratio of 0.7, which means 70% of data from the train set will remain in the train set, and the difference, 30% of the train set, will go to the validation set

split_validation_from_train_set(my_data_dir= 'Covid19-dataset',
                                train_set_folder_name = 'train',
                                train_set_ratio=0.8)

If you check the folder structure of the Covid19-dataset, you will see it now contains folders for test, train, and validation

!ls Covid19-dataset