# Skin Cancer Classification - Convolutional Network

### by ReDay Zarra

This project utilizes a convolutional network to **identify 9 different kinds of skin cancers** including melanoma, nevus, and more. The model is **trained on over 2,200 pictures of various skin cancers** based off of this [dataset](https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic). This model implements fundamental computer vision and classification techniques and includes a *step-by-step implementation of the model* as well as *in-depth notes to customize the model further* for higher accuracy.

## Importing the necessary libraries

Importing the essential **libraries for data manipulation and numerical analysis**. We will also need libraries for **data visualization and plotting**. Pickle will be used to **compress our folder of images** into train.p, valid.p, and test.p

In [94]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn

import os
import pickle

## Compressing image files to Pickle

The [dataset](https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic) provides us with **over 2,350 images in two different folders** called "Train" and "Test". The folders **contain many sub-folders (whose names are labels)** which contain many different images. We need to **compress these images into train.p, valid.p, and test.p** all with the labels assigned. 

In [95]:
# Specify the directory containing the Train and Test folders
main_folder = 'skin-cancers'
train_folder = 'Train'

# Initialize an empty list to store the image data and labels
train_data = []

# Iterate through the sub-folders in the Train folder
for sub_folder in os.listdir(os.path.join(main_folder, train_folder)):
    sub_folder_path = os.path.join(main_folder, train_folder, sub_folder)
    for image_file in os.listdir(sub_folder_path):
        # Add the image data and label to the list
        with open(os.path.join(sub_folder_path, image_file), 'rb') as f:
            image_data = f.read()
        # Assign the label as the sub-folder's name
        label = sub_folder
        train_data.append((image_data, label))

# Save the train data list as a .p file using pickle
with open(os.path.join(main_folder, 'train.p'), 'wb') as f:
    pickle.dump(train_data, f)

# Repeat steps for testing data
test_folder = 'Test'
test_data = []
for sub_folder in os.listdir(os.path.join(main_folder, test_folder)):
    sub_folder_path = os.path.join(main_folder, test_folder, sub_folder)
    for image_file in os.listdir(sub_folder_path):
        with open(os.path.join(sub_folder_path, image_file), 'rb') as f:
            image_data = f.read()
        label = sub_folder
        test_data.append((image_data, label))

with open(os.path.join(main_folder, 'test.p'), 'wb') as f:
    pickle.dump(test_data, f)

## Loading the dataset

The code above creates the dataset but they still need to be loaded into the file. This can be done by Pickle's simple-to-use **open() function which loads the dataset** into a variable, in this case train and test.

In [100]:
import numpy as np

with open('train.p', 'rb') as f:
    train = pickle.load(f)
    images = train[0]
    images = np.array(images)
    print(images.shape)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

In [71]:
# Load the train data from the .p file
with open('train.p', 'rb') as f:
    train = pickle.load(f)

# Load the test data from the .p file
with open('test.p', 'rb') as f:
    test = pickle.load(f)


In [91]:
X_train, y_train = train[0], train[1]

In [93]:
X_train = np.array(X_train)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

## Splitting the dataset

The dataset can now be **split into training and testing sets**. We need to make sure that **X_train and X_test only contain image data**, because those are the features. The **labels**, or the dependent variable, will be **stored in y_train and y_test**.

### Creating training and testing sets

The training and testing sets can be **created with a simple for loop** that will **iterate through every element** in the train and test variables (which stores all images) and **adds them to sepearte empty arrays** for features and labels.

In [72]:
# X_train will contain the image data and y_train will contain the labels
X_train, y_train = [], []
for image_data, label in train:
    X_train.append(image_data)
    y_train.append(label)

# X_test will contain the image data and y_test will contain the labels
X_test, y_test = [], []
for image_data, label in test:
    X_test.append(image_data)
    y_test.append(label)

### Creating validation set

The validation set can be **created with by using sci-kit learn's train_test_split()** function because we can simply **divide the training set** that we already have. I have chosen to **assign 20% of my training set** to my validation set. The validation data is then **stored separately in valid.p** file.

In [73]:
from sklearn.model_selection import train_test_split

# Split the train data into a validation set, using a 80/20 split
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state = 0)

# save the validation data as valid.p using pickle
with open('valid.p', 'wb') as f:
    pickle.dump((X_validation, y_validation), f)

### Convert to arrays

> In order to manipulate the data any further, the training, testing, and validation **datasets need to be converted into arrays**. The .array() method from NumPy allows an simple way to do just that.

In [84]:
import numpy as np

# Convert the list of image data into a numpy array
X_train = np.array(X_train)
y_train = np.array(y_train)

In [88]:
len(X_train)

1791

In [67]:
import numpy as np

# Convert the list of image data into a numpy array
X_validation = np.array(X_validation)
y_validation = np.array(y_validation)

In [69]:
import numpy as np

# Convert the list of image data into a numpy array
X_test = np.array(X_test)
y_test = np.array(y_test)

## Checking the dimensions of the dataset

Before processing the data, it is necessary to make sure the dataset and the variables we have stored them in are correct. We can easily see the shape of the 

In [81]:
batch_size = X_train.shape[0]
height = X_train[0].shape[0]
X_train = np.reshape(X_train, (batch_size, height, width, channels))

IndexError: tuple index out of range

In [78]:
X_train.shape

(1791,)

In [55]:
# Get the input size of the images
input_size = X_train.shape[1:]

# Get the input depth of the images
input_depth = input_size[-1]

(1791,)

In [42]:
# Get the input size of the images
input_size = X_train.shape[1:]

# Get the input depth of the images
input_depth = input_size[-1]

(1791,)

In [43]:
X_validation.shape

(448,)

In [44]:
y_validation.shape

(448,)

In [45]:
X_test.shape

(118,)

In [46]:
y_test.shape

(118,)