# Lung and Colon Cancer Classification
## About Dataset
This dataset contains 25,000 histopathological images with 5 classes. All images are 768 x 768 pixels in size and are in jpeg file format.
The images were generated from an original sample of HIPAA compliant and validated sources, consisting of 750 total images of lung tissue (250 benign lung tissue, 250 lung adenocarcinomas, and 250 lung squamous cell carcinomas) and 500 total images of colon tissue (250 benign colon tissue and 250 colon adenocarcinomas) and augmented to 25,000 using the Augmentor package.
There are five classes in the dataset, each with 5,000 images, being:

* Lung benign tissue
* Lung adenocarcinoma
* Lung squamous cell carcinoma
* Colon adenocarcinoma
* Colon benign tissue


How to Cite this Dataset
If you use in your research, please credit the author of the dataset:

Original Article
Borkowski AA, Bui MM, Thomas LB, Wilson CP, DeLand LA, Mastorides SM. Lung and Colon Cancer Histopathological Image Dataset (LC25000). arXiv:1912.12142v1 [eess.IV], 2019

Relevant Links
https://arxiv.org/abs/1912.12142v1
https://github.com/tampapath/lung_colon_image_set
Dataset BibTeX
@article{,
title= {LC25000 Lung and colon histopathological image dataset},
keywords= {cancer,histopathology},
author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
url= {https://github.com/tampapath/lung_colon_image_set}
}


## Imports

In [4]:
import os

import pyrootutils

root = pyrootutils.setup_root(
    search_from=os.path.dirname(os.getcwd()),
    indicator=[".git", "pyproject.toml"],
    pythonpath=True,
    dotenv=True,
)

if os.getenv("DATA_ROOT") is None:
    os.environ["DATA_ROOT"] = f"{root}"

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
from hydra import compose, initialize
from torchvision.datasets import ImageFolder

from src.utils.download_kaggel_ds import download_kaggle_dataset, flatten_dataset_dir

## Download datasets

In [23]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
# https://gist.github.com/bdsaglam/586704a98336a0cf0a65a6e7c247d248

with initialize(version_base="1.2", config_path="../configs"):
    cfg = compose(config_name="train")
    print(cfg.paths)

{'root_dir': '${oc.env:PROJECT_ROOT}', 'results_dir': '${paths.root_dir}/results', 'best_model_json_name': 'best_model.json', 'best_model_path': '${paths.results_dir}', 'cloud_model_key': 'cloud_model.ckpt', 'cloud_model_save_path': '${paths.results_dir}/cloud_model.ckpt', 'log_dir': '${paths.root_dir}/logs/', 'output_dir': '${hydra:runtime.output_dir}', 'work_dir': '${hydra:runtime.cwd}', 'train_raw_dir': '${data.dataset_dir}/raw/train', 'valid_raw_dir': '${data.dataset_dir}/raw/valid', 'test_raw_dir': '${data.dataset_dir}/raw/test', 'train_processed_dir': '${data.dataset_dir}/processed/train', 'valid_processed_dir': '${data.dataset_dir}/processed/valid', 'test_processed_dir': '${data.dataset_dir}/processed/test'}


In [None]:
DATASET_DIR = Path(root) / cfg.data.dataset_dir

# download_kaggle_dataset(cfg.data.dataset_download_name, DATASET_DIR)

In [27]:
flatten_dataset_dir(DATASET_DIR)

## Loading Images

In [None]:
CLASS_NAMES = [
    "colon-adenocarcinoma",
    "colon-benign-tissue",
    "lung-adenocarcinoma",
    "lung-benign-tissue",
    "lung-squamous-cell-carcinoma",
]

class_mapping = dict(zip(range(len(CLASS_NAMES)), CLASS_NAMES, strict=False))
class_mapping

In [None]:
cfg.paths.train_dir

In [None]:
DATASET_DIR = Path(root) / cfg.data.dataset_dir / cfg.data.dataset_name

TRAIN_IMAGE_DIR = Path(root) / cfg.paths.train_dir
VALID_IMAGE_DIR = Path(root) / cfg.paths.validation_dir
TEST_DIR = Path(root) / cfg.paths.test_dir
DATASET_DIR

## Loading Images

In [None]:
datasets = ImageFolder(
    root=str(DATASET_DIR),
    transform=None,
    target_transform=None,
    is_valid_file=None,
)
print(f"Number of images in the dataset: {len(datasets)}")

In [None]:
datasets.class_to_idx

In [None]:
for item in datasets:
    image, label = item
    print(f"Image shape: {image.size}, label: {label}")
    plt.imshow(image)
    plt.title(datasets.classes[label])
    plt.axis("off")
    plt.show()
    break

In [None]:
dir(datasets)

In [None]:
import random

IMAGE_NUM = 9
random_range = random.sample(range(0, len(datasets)), IMAGE_NUM)
random_range

In [None]:
plt.figure(figsize=(10, 10))

for i, idx in enumerate(random_range):
    plt.subplot(3, 3, i + 1)
    img = datasets[idx][0]
    plt.imshow(img, cmap="gray")
    plt.axis("off")
    plt.title(f"{datasets.classes[datasets[idx][1]]}-{img.size}")  # datasets.classes[label])
plt.tight_layout()
plt.show()

## Image count

In [None]:
for class_name in CLASS_NAMES:
    class_dir = DATASET_DIR / class_name
    print(f"Class: {class_name}, Number of images: {len(list(class_dir.iterdir()))}")