# HW5 Kaggle competition

Participate this Kaggle competition: Histopathologic Cancer Detection
https://www.kaggle.com/c/histopathologic-cancer-detection/overview

After submitting the result, please submit a report and supporting material.
You can submit as improve your results as many times as you want before the homework due.  The iterative process take time, so to get a better quality results and report, please start early. The grades are more based on the quality and depth of the analysis not just on a better kaggle score.

Your report should include
1. brief description of the problem, data (e.g. size and dimension, structure etc)
2. Exploratory data analysis showing a few visualization, histogram, etc, and a plan of analysis. Any data cleaning procedure.
3. You model architecture and reasoning why you believe certain architecture would be suitable for this problem
4. results (tables, figures etc) and analysis (reasoning of why or why not something worked well, also troubleshooting and hyperparameter optimization procedure summary)
5. your conclusion.

[Deliverables- submission guide]
1. Submit your report in moodle HW5 submission box. Your report should be in .pdf format. (you can print out your notebook if you prefer)
Besides above report contents, your report also should include,
Please include the url to your git (github or similar, your repo needs to be public so that I can see) in your report.
Please include the screenshot of your leaderboard position/score.
Importantly, your name at the top :)
2. Upload your material such as notebook/codes/scripts to your git repository (please do not upload big files such as data)

[Using online references]
Kaggle competition is not only a tool for datascience challenges and practice, but also a great tool to learn and share skills and tips. In the kernel, there are several notebooks that people shared. Feel free to refer them, but please cite them properly. Also please keep in mind that you need to do more than what's available in those in order to get scores. 

**The Git Repository for this project can be found at [https://github.com/seel6470/CSPB-3202-HW5](https://github.com/seel6470/CSPB-3202-HW5)**

## Final Submission Screenshot:

![image](./images/Screenshot%202024-08-01%20151900.png)

## Description

*brief description of the problem, data (e.g. size and dimension, structure etc)*

The Histopathologic Cancer Detection Kaggle competition (located at https://www.kaggle.com/c/histopathologic-cancer-detection/overview ) seeks to create a machine learning algorithm that can detect a cancer cell given a pathology image of tumor tissue. The problem itsel is a __binary classification problem__ with either a negative of positive prognosis. 

Furthermore, in the data description, it is specified that the center 32 x 32 pixel region of the image must contain at least one pixel of tumor tissue in order to be a positive prognosis.

The data is contained in a file structure of the data contains two directories, `train` and `test`

The `train` directory contains 220,025 tif images while the `test` directory contains 57,458 tif images.

<pre>

data/
├── train/
    ├── 0000d563d5cfafc4e68acb7c9829258a298d9b6a.tif
    ├── 0000da768d06b879e5754c43e2298ce48726f722.tif
    ├── 0000f8a4da4c286eee5cf1b0d2ab82f979989f7b.tif
    ...
├── test/
    ├── 0000ec92553fda4ce39889f9226ace43cae3364e.tif
    ├── 000c8db3e09f1c0f3652117cf84d78aae100e5a7.tif
    ├── 000de14191f3bab4d2d6a7384ca0e5aa5dc0dffe.tif
    ...
└──
</pre>

Each file represents a color 96 x 96 image with each pixel represented as a 24 bit RGB value.

The classification of all images is contained in a csv file labeled train_labels.csv with two columns, "id" and "label"

The "id" column represents the filename (without file extension) which would be a categorical nominal data type since it represents a value with no order or ranking, while the "label" column represents the binary classification 0 or 1, which would also be considered categorical nominal data as well.



## Exploratory Data Analysis

*Exploratory data analysis showing a few visualization, histogram, etc, and a plan of analysis. Any data cleaning procedure.*

It would be helpful to determine if all images have a consistent resolution, or if there may be differing image sizes. It is challenging to do so efficiently, due to the size of the data set, but creating a subset of the images and determining the resolution sizes of the images may provide more clarity.

> __Note:__ Due to the size of the data set, I chose to work in a local environment, downloading all images to a local directory and running my scripts from the command line. The following code may not be executable in this notebook, but the outputs received in my local environment are shown after each code block.


```python
import pandas as pd
import matplotlib as plt
import os
from PIL import Image
import matplotlib.pyplot as plt

data = pd.read_csv('train_labels.csv', dtype=str)

data['id'] = data['id'] + '.tif'  # Add file extension

# Select a random subset of 256 images
subset = data.sample(n=256, random_state=1975)

train_directory = './train'

# Create lists to store image widths and heights
widths = []
heights = []

# Iterate over the subset to get image dimensions
for image_file in subset['id']:
    image_path = os.path.join(train_directory, image_file)
    if os.path.exists(image_path):
        with Image.open(image_path) as img:
            width, height = img.size
            widths.append(width)
            heights.append(height)
    else:
        print(f"Image file {image_file} does not exist.")

# Plot Histogram for Widths
plt.figure(figsize=(12, 6))
plt.hist(widths, bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Width (pixels)')
plt.ylabel('Number of Images')
plt.title('Histogram of Image Widths for 256 Random Samples')
plt.tight_layout()
plt.savefig('image_widths_histogram.png')

# Plot Histogram for Heights
plt.figure(figsize=(12, 6))
plt.hist(heights, bins=20, color='salmon', edgecolor='black')
plt.xlabel('Height (pixels)')
plt.ylabel('Number of Images')
plt.title('Histogram of Image Heights for 256 Random Samples')
plt.tight_layout()
plt.savefig('image_heights_histogram.png')
```

![image](./images/image_heights_histogram.png)

![image](./images/image_widths_histogram.png)

As we can see, the resolution sizes for the random sample is exclusively 96 x 96 x 96. This is helpful, since we will need to create a way to crop the center 32 x 32 pixels given the data description on Kaggle, and the uniformity of the image size will ensure that the center will exist in the same location for all images. We can further confirm this by outputting several random images:

```python
# get 4 random images
subset = data.sample(n=4, random_state=1975)

train_directory = './train'

# Create a figure to plot the images
plt.figure(figsize=(10, 10))

# Initialize index
index = 1

# Plot and save each image
for image_file in subset['id']:
    image_path = os.path.join(train_directory, image_file)
    if os.path.exists(image_path):
        with Image.open(image_path) as img:
            plt.subplot(2, 2, index)  # Use index to determine subplot position
            plt.imshow(img)
            plt.title(f'Image {index}')
            plt.axis('off')  # Turn off axis
            index += 1
    else:
        print(f"Image file {image_file} does not exist.")

# Save the plot
plt.tight_layout()
plt.savefig('sample_images.png')
```

![image](./images/sample_images.png)

Additionally, it would be helpful to know what distribution of labels we have in our training data. We would hope to see an equal amount of binary classifications to avoid bias in our model, however it would be beneficial to understand our data if the distribution is otherwise.
```python
import pandas as pd
import matplotlib as plt
import os
from PIL import Image
import matplotlib.pyplot as plt

data = pd.read_csv('train_labels.csv', dtype=str)

label_counts = data['label'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%', colors=['skyblue', 'salmon'])
plt.title('Distribution of Binary Labels (0 and 1)')
plt.tight_layout()

# Save and show the pie chart
plt.savefig('binary_labels_pie_chart.png')
```

![image](./images/binary_labels_pie_chart.png)

We can see that we have more negative than positive prognosis images in our training set. Given the large size of the data set, this should not cause too much of an issue. If there was a higher class imbalance (say 90/10) there would be cause for concern, however we want to make sure we watch out to see if the model learns that it is more accurate to predict 0 than 1 given the current imbalance. Because of this, we may seek to use other metrics than accuracy, such as the area under the ROC curve.

## Model Architecture

*You model architecture and reasoning why you believe certain architecture would be suitable for this problem*

## Results

*results (tables, figures etc) and analysis (reasoning of why or why not something worked well, also troubleshooting and hyperparameter optimization procedure summary)*

## Conclusion