# Download Dataset

**You will likely not use this notebook if you are creating your research project with your own new data.**

This is for downloading a dataset from the internet, as for our example with the [MedMNIST](https://zenodo.org/records/10519652) dataset.
Your dataset will likely be uploaded from wherever you collected your data.

In [1]:
!pwd

/blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/notebooks


Let's first set up where our data will be downloaded to.
We're going to use a clever package called `rootutils` to help us with this.
Let's (1) ensure that the empty file ".project-root" is in the root of our project, and (2) add our chosen data directory to a new ".env" file.
For this second part, copy the file ".env.example" to ".env" and add the path to your data directory.


In [6]:
import rootutils
import os
root = rootutils.setup_root(search_from=os.getcwd(), indicator=".project-root", dotenv=True, pythonpath=True, cwd=True)
# Print the root directory
print(root)
# Print the environment variable DATA_DIR
print(os.environ["DATA_DIR"])

current file is  /blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template
/blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template
/blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/


We will now download our dataset to the data directory we specified in the ".env" file:

In [8]:
import monai
import os

# download the data if it's not already downloaded
resource = "https://msd-for-monai.s3-us-west-2.amazonaws.com/Task09_Spleen.tar"
md5 = "410d4a301da4e5b2f6f86ec3ddba524e"
data_dir = os.environ["DATA_DIR"]
extract_dir = os.path.join(data_dir, "Task09_Spleen")

compressed_file = os.path.join(data_dir, "Task09_Spleen.tar")
if not os.path.exists(extract_dir):
    # print the directory it will be downloaded to
    print(f"Data will be downloaded to {extract_dir}")
    monai.apps.download_and_extract(resource, compressed_file, data_dir, md5)

Data will be downloaded to /blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/Task09_Spleen


Task09_Spleen.tar: 1.50GB [01:28, 18.2MB/s]                               


2024-07-11 13:09:33,707 - INFO - Downloaded: /blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/Task09_Spleen.tar
2024-07-11 13:09:40,167 - INFO - Verified 'Task09_Spleen.tar', md5: 410d4a301da4e5b2f6f86ec3ddba524e.
2024-07-11 13:09:40,169 - INFO - Writing into directory: /blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/.


Now, let's create a master CSV file that contains the filenames of all of the images and labels in our dataset.
We'll talk more about why this is important in the [next notebook](1.0-create_splits.ipynb).

We'll use only the images and labels from the training set (`imageTr` and `labelTr`, respectively), as this dataset does not have annotated labels for its test set (`imageTs`).
Let's put the resulting CSV file, called "spleen.csv", in the `splits` directory.

This whole notebook is specific to this dataset and may not generalize to other datasets.
You may already have a CSV file that contains information about your dataset.

In [19]:
# Create a CSV with 
# The header row should be "image,label"
# The images should be just the filenames relative to imagesTr
# The labels should be the filenames relative to labelsTr
# Let's save the CSV file to the `splits` directory
import csv
import glob

image_dir = os.path.join(data_dir, "Task09_Spleen/imagesTr")
label_dir = os.path.join(data_dir, "Task09_Spleen/labelsTr")

# make a list of the files in both directories while excluding the ones that start with "_"
image_files = sorted([os.path.basename(f) for f in glob.glob(os.path.join(image_dir, "*.nii.gz")) if not os.path.basename(f).startswith("_")])
label_files = sorted([os.path.basename(f) for f in glob.glob(os.path.join(label_dir, "*.nii.gz")) if not os.path.basename(f).startswith("_")])
print(image_files)

# confirm that the files are the same
assert image_files == label_files

# create a list of tuples with the image and label filenames
data = [(image, label) for image, label in zip(image_files, label_files)]

# write the data to a CSV file with the header "image,label"
csv_file = os.path.join(root, "splits/spleen.csv")
with open(csv_file, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["image", "label"])
    writer.writerows(data)

print(f"CSV file saved to {csv_file}")

/blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/
/blue/prismap-ai-core/sasank.desaraju/projects/lightning-hydra-template/data/Task09_Spleen/imagesTr
['imagesTs', '._dataset.json', 'labelsTr', 'dataset.json', '._labelsTr', '._imagesTs', '._imagesTr', 'imagesTr']
['spleen_38.nii.gz', '._spleen_20.nii.gz', 'spleen_32.nii.gz', 'spleen_16.nii.gz', 'spleen_59.nii.gz', '._spleen_28.nii.gz', '._spleen_60.nii.gz', 'spleen_22.nii.gz', 'spleen_24.nii.gz', '._spleen_31.nii.gz', '._spleen_25.nii.gz', 'spleen_41.nii.gz', 'spleen_2.nii.gz', '._spleen_26.nii.gz', 'spleen_56.nii.gz', '._spleen_3.nii.gz', 'spleen_47.nii.gz', '._spleen_46.nii.gz', '._spleen_62.nii.gz', 'spleen_14.nii.gz', '._spleen_45.nii.gz', 'spleen_10.nii.gz', 'spleen_13.nii.gz', '._spleen_52.nii.gz', 'spleen_25.nii.gz', '._spleen_16.nii.gz', '._spleen_19.nii.gz', 'spleen_33.nii.gz', '._spleen_13.nii.gz', 'spleen_6.nii.gz', 'spleen_49.nii.gz', '._spleen_2.nii.gz', 'spleen_60.nii.gz', 'spleen_27.nii.gz', '.