# Preparing dataset
This notebook is intended to download the dataset from the source and prepare it for exploration.

### Downloading COVID dataset

Download the data from `Kaggle` using KaggleHub library.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("tawsifurrahman/covid19-radiography-database")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/rehabaam/.cache/kagglehub/datasets/tawsifurrahman/covid19-radiography-database/versions/5


In [3]:
import sys
sys.path.append("../")

In [4]:
from src.common.dataset_manager import copy_raw_data, convert_excel_to_csv, create_class_folders, move_files_to_class_folders

# Folders to store the data
raw_data_dir = '../data/raw'
old_data_dir = '../data/raw/COVID-19_Radiography_Dataset'
new_data_dir = '../data/raw/dataset'

Copy all files from cache folder to `data/raw` folder.

In [5]:
copy_raw_data(path, raw_data_dir)

Convert all excel sheets to CSV files. (*This is an optional step if you prefer working with excel sheets rather than CSV files.*)

In [6]:
convert_excel_to_csv(old_data_dir)

Converted Lung_Opacity.metadata.xlsx - to Lung_Opacity.metadata.csv
Lung_Opacity.metadata.xlsx has been removed
Converted Viral Pneumonia.metadata.xlsx - to Viral Pneumonia.metadata.csv
Viral Pneumonia.metadata.xlsx has been removed
Converted COVID.metadata.xlsx - to COVID.metadata.csv
COVID.metadata.xlsx has been removed
Converted Normal.metadata.xlsx - to Normal.metadata.csv
Normal.metadata.xlsx has been removed


### Simplify the folders stucture for future use of *ImageDataGenerator* library

1. Get the image categories and classes

In [7]:
import os

folders = [x[1] for x in os.walk(old_data_dir, topdown=True)]
classes = folders[0]
categories = folders[1]

print(f"Found {len(classes)} classes: {classes}")
print(f"Found {len(categories)} categories: {categories}")

Found 4 classes: ['Viral Pneumonia', 'Lung_Opacity', 'Normal', 'COVID']
Found 2 categories: ['images', 'masks']


2. Create a new dataset folder with *dataset* name and create the categoies and classes under it

In [8]:
create_class_folders(new_data_dir, categories, classes)

Creating directory '../data/raw/dataset/images/Viral Pneumonia'...
Directory '../data/raw/dataset/images/Viral Pneumonia' created successfully.
Creating directory '../data/raw/dataset/images/Lung_Opacity'...
Directory '../data/raw/dataset/images/Lung_Opacity' created successfully.
Creating directory '../data/raw/dataset/images/Normal'...
Directory '../data/raw/dataset/images/Normal' created successfully.
Creating directory '../data/raw/dataset/images/COVID'...
Directory '../data/raw/dataset/images/COVID' created successfully.
Creating directory '../data/raw/dataset/masks/Viral Pneumonia'...
Directory '../data/raw/dataset/masks/Viral Pneumonia' created successfully.
Creating directory '../data/raw/dataset/masks/Lung_Opacity'...
Directory '../data/raw/dataset/masks/Lung_Opacity' created successfully.
Creating directory '../data/raw/dataset/masks/Normal'...
Directory '../data/raw/dataset/masks/Normal' created successfully.
Creating directory '../data/raw/dataset/masks/COVID'...
Directory 

3. Move the images to the newly created folders

In [9]:
move_files_to_class_folders(old_data_dir, new_data_dir, categories, classes)

Files copied successfully.
Number of files: 1345 from '../data/raw/COVID-19_Radiography_Dataset/Viral Pneumonia/images' to '../data/raw/dataset/images/Viral Pneumonia'...
Files copied successfully.
Number of files: 6012 from '../data/raw/COVID-19_Radiography_Dataset/Lung_Opacity/images' to '../data/raw/dataset/images/Lung_Opacity'...
Files copied successfully.
Number of files: 10192 from '../data/raw/COVID-19_Radiography_Dataset/Normal/images' to '../data/raw/dataset/images/Normal'...
Files copied successfully.
Number of files: 3616 from '../data/raw/COVID-19_Radiography_Dataset/COVID/images' to '../data/raw/dataset/images/COVID'...
Files copied successfully.
Number of files: 1345 from '../data/raw/COVID-19_Radiography_Dataset/Viral Pneumonia/masks' to '../data/raw/dataset/masks/Viral Pneumonia'...
Files copied successfully.
Number of files: 6012 from '../data/raw/COVID-19_Radiography_Dataset/Lung_Opacity/masks' to '../data/raw/dataset/masks/Lung_Opacity'...
Files copied successfully.
