## **Instruction**

This project, Cloud classification, is part of CASA0018 Deep Learning project. The dataset is available to download from the link below. However, some images are taken by myself or from publis image repository such as Flickr. Follow the steps on this tutorial to see the result. One may have different result since the datasets are not identic.


1.   Download Cirrus Cumulus Stratus Nimbus (CCSN) Database
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CADDPD&version=2.0

2.   Download Howard-Cloud-X
https://www.kaggle.com/datasets/imbikramsaha/howard-cloudx/code

### **Data Cleaning**
#### Manual checking the image

While CCSN datasets has 10 cloud types + 1 contrail, Howard-Cloud-X only has 10 cloud types without contrail.

First, lets see how many pictures for each type of cloud from the two datasets

In [2]:
import cv2 #opencv
import os
import time
import uuid

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
CCSN_path = '/content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/CCSN_v2' #Directories of Google Drive

os.chdir(CCSN_path) #Go to path
os.listdir(CCSN_path) #List out files in the path
print("Files and directories in '", CCSN_path, "' :", os.listdir(CCSN_path))

Files and directories in ' /content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/CCSN_v2 ' : ['Sc', 'Ns', 'Ci', 'Cu', 'Cs', 'Ct', 'St', 'As', 'Cc', 'Ac', 'Cb']


In [5]:
Howard_path = '/content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/Howard-Cloud-X' #Directories of Google Drive
os.chdir(Howard_path) #Go to path
os.listdir(Howard_path) #List out files in the path
print("Files and directories in '", Howard_path, "' :", os.listdir(Howard_path))

Files and directories in ' /content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/Howard-Cloud-X ' : ['Ac', 'As', 'Cc', 'Cs', 'Ci', 'Cb', 'Cu', 'Ns', 'Sc', 'St']


In [76]:
CCSN_dir = os.listdir(CCSN_path)
for dir_name in CCSN_dir:
    # Create the full path to the directory
    dir_path = os.path.join(CCSN_path, dir_name)

    # Get a list of files in the directory
    files = os.listdir(dir_path)

    # Count the number of files
    num_files = len(files)

    # Print the directory name and the number of files it contains
    print(f"{dir_name}: {num_files} files")

Sc: 208 files
Ns: 162 files
Ci: 74 files
Cu: 109 files
Cs: 58 files
Ct: 155 files
St: 257 files
As: 65 files
Cc: 105 files
Ac: 118 files
Cb: 193 files


In [77]:
Howard_dir = os.listdir(Howard_path)
for dir_name in Howard_dir:
    # Create the full path to the directory
    dir_path = os.path.join(Howard_path, dir_name)

    # Get a list of files in the directory
    files = os.listdir(dir_path)

    # Count the number of files
    num_files = len(files)

    # Print the directory name and the number of files it contains
    print(f"{dir_name}: {num_files} files")

Ac: 136 files
As: 187 files
Cc: 131 files
Cs: 129 files
Ci: 120 files
Cb: 124 files
Cu: 188 files
Ns: 131 files
Sc: 132 files
St: 137 files


#### Combine files from both datasets

In [78]:
import shutil

combined_path = '/content/combined_datasets_new'
for dir_name in os.listdir(CCSN_path):
    combined_subdir = os.path.join(combined_path, dir_name)
    if not os.path.exists(combined_subdir):
        os.makedirs(combined_subdir)

# Function to copy files from source to destination
def copy_files(source_path):
    for dir_name in os.listdir(source_path):
        source_dir_path = os.path.join(source_path, dir_name)
        destination_dir_path = os.path.join(combined_path, dir_name)

        if not os.path.exists(destination_dir_path):
            os.makedirs(destination_dir_path)


        for file_name in os.listdir(source_dir_path):
            src_file_path = os.path.join(source_dir_path, file_name)
            dst_file_path = os.path.join(destination_dir_path, file_name)

            file_counter = 1
            while os.path.exists(dst_file_path):
                name, ext = os.path.splitext(file_name)
                new_name = f"{name}_{file_counter}{ext}"
                dst_file_path = os.path.join(destination_dir_path, new_name)
                file_counter += 1

            # Copy the file to the destination directory
            shutil.copy(src_file_path, dst_file_path)

copy_files(CCSN_path)
copy_files(Howard_path)


#### Recheck how many files there are in each cloud types

In [87]:
combined_path = '/content/combined_datasets'
combined_dir = os.listdir(combined_path)

for dir_name in combined_dir:
    # Create the full path to the directory
    dir_path = os.path.join(combined_path, dir_name)

    # Get a list of files in the directory
    files = os.listdir(dir_path)

    # Count the number of files
    num_files = len(files)

    # Print the directory name and the number of files it contains
    print(f"{dir_name}: {num_files} files")

Ci: 194 files
Ns: 293 files
Sc: 340 files
Cu: 297 files
Cc: 236 files
Ac: 254 files
As: 252 files
St: 394 files
Cb: 317 files
Cs: 187 files
Ct: 155 files


#### **Ohh noo**
Looks like our dataset is imbalance, and some cloud even has less than 200 pictures. Let's add more dataset.

This is how you can add more dataset:
1. Download one by one image from Google or Flickr.

**or**


2. Bulk download using some Python scripts.
I opt for this option because I'm a ~lazy~ clever person (sorry Duncan)

I found the way [how to bulk download images from Flickr](https://www.youtube.com/watch?v=9sBQqlTtQ2k) on Youtube. Credits goes to [Jeff Heaton](https://www.heatonresearch.com/) for his amazing [pyimgdata](https://github.com/jeffheaton/pyimgdata) script to download image from Flickr.
What really surprising is this python code still working eventhough it is already 4 years old!.

Now lets run the code

#### **Run pyimgdata scripts**

1. First, you need to [register](https://www.flickr.com/services/apps/create/apply) for Flickr API, then you will obtain your `key`and `secret`.
2. download [pyimgdata](https://github.com/jeffheaton/pyimgdata) from Github.
3. Open `config_flickr.ini` file and insert your `key`and `secret`. Also edit other variables for your desire projects.
4. Import flickr API and Pillow packages


In [28]:
pip install flickrapi



In [38]:
from PIL import Image

5. Run the `flickr-download.py` on your machine. For my case, I will run 11 times since I want to add more datasets for 11 cloud types

In [59]:
import os
os.chdir('/content')
!python flickr-download.py

2024-04-24 10:07:51,811 - root - INFO : Line 166 - Starting...
  Image.ANTIALIAS)
2024-04-24 10:08:14,386 - root - INFO : Line 158 - Writing sources file.
2024-04-24 10:08:14,387 - root - INFO : Line 192 - Complete, elapsed time: 0:00:22.58


In [82]:
flickr_path = '/content/flickr_datasets'
flickr_dir = os.listdir(flickr_path)

for dir_name in flickr_dir:
    # Create the full path to the directory
    dir_path = os.path.join(flickr_path, dir_name)

    # Get a list of files in the directory
    files = os.listdir(dir_path)

    # Count the number of files
    num_files = len(files)

    # Print the directory name and the number of files it contains
    print(f"{dir_name}: {num_files} files")

Ci: 352 files
Ns: 352 files
Sc: 352 files
Cu: 352 files
Cc: 365 files
Ac: 379 files
As: 361 files
St: 74 files
.ipynb_checkpoints: 0 files
Cb: 352 files
Cs: 332 files
Ct: 352 files


6. Back up flickr datasets & combined datasets (CCSN & Howard) to google drive

In [88]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Path to the directory you want to copy
flickr_directory = '/content/flickr_datasets'

# Path in your Google Drive where you want to store the data
flickr_destination_directory = '/content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/flick_datasets'

# Copy the entire directory to Google Drive
!cp -r {source_directory} {destination_directory}

print("Files have been copied to your Google Drive.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Files have been copied to your Google Drive.


In [89]:
# Path to the directory you want to copy
combined_directory = '/content/combined_datasets'

# Path in your Google Drive where you want to store the data
combined_destination_directory = '/content/drive/MyDrive/Kuliah/UCL/Projects/CASA0018-Cloud-Classification/Projects/Final_Project/Datasets/combined_datasets'

# Copy the entire directory to Google Drive
!cp -r {source_directory} {destination_directory}

print("Files have been copied to your Google Drive.")

Files have been copied to your Google Drive.


## **Data Processing**
#### Import Dependancies and modules required

In [None]:
import os #for creating path names and manipulating directories/files in an operating system
import random
import tensorflow as tf

In [None]:
import os #for creating path names and manipulating directories/files in an operating system
import random
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from shutil import copyfile
from PIL import Image
import PIL
import numpy as np
import cv2
import matplotlib.pyplot as plt
import datetime #for tensorboard and logging

print(tf.__version__)

2.15.0


### Connect to Data Directory in Google Drive

##### CCSN Datasets

##### Howard Datasets