# Downloading subset of the DeepLesion dataset
### CS5330
### by Talal Siddiqui and Zeyu Wang
### December 10th, 2024

This notebook downloads data from the DeepLesion dataset's repository. We only download a subset of the data because Google Colab cannot store all 220 GB of CT scans.

The dataset is located at https://nihcc.app.box.com/v/DeepLesion/folder/50715173939

This subset is meant for the YOLOv5n model.

In [None]:
import urllib.request
import os

data_download_dir = "raw_data"

if not os.path.exists(data_download_dir):
  os.mkdir(data_download_dir)

# Small subset of links from the dataset
data_urls = ['https://nihcc.box.com/shared/static/sp5y2k799v4x1x77f7w1aqp26uyfq7qz.zip',
    'https://nihcc.box.com/shared/static/l9e1ys5e48qq8s409ua3uv6uwuko0y5c.zip',
    'https://nihcc.box.com/shared/static/48jotosvbrw0rlke4u88tzadmabcp72r.zip',
    'https://nihcc.box.com/shared/static/xa3rjr6nzej6yfgzj9z6hf97ljpq1wkm.zip',
    'https://nihcc.box.com/shared/static/58ix4lxaadjxvjzq4am5ehpzhdvzl7os.zip',
    'https://nihcc.box.com/shared/static/cfouy1al16n0linxqt504n3macomhdj8.zip',
    'https://nihcc.box.com/shared/static/z84jjstqfrhhlr7jikwsvcdutl7jnk78.zip',
    'https://nihcc.box.com/shared/static/6viu9bqirhjjz34xhd1nttcqurez8654.zip',
    'https://nihcc.box.com/shared/static/9ii2xb6z7869khz9xxrwcx1393a05610.zip',
    'https://nihcc.box.com/shared/static/2c7y53eees3a3vdls5preayjaf0mc3bn.zip']

md5_link = 'https://nihcc.box.com/shared/static/q0f8gy79q2spw96hs6o4jjjfsrg17t55.txt'
urllib.request.urlretrieve(md5_link, "MD5_checksums.txt")  # download the MD5 checksum file

for idx, link in enumerate(data_urls):
    fn = 'Images_png_%02d.zip' % (idx+1)
    full_fn = os.path.join(data_download_dir, fn)
    print ('downloading', fn, '...')
    urllib.request.urlretrieve(link, full_fn)  # download the zip file
print ("Download complete. Please check the MD5 checksums")

downloading Images_png_01.zip ...
downloading Images_png_02.zip ...
downloading Images_png_03.zip ...
downloading Images_png_04.zip ...
downloading Images_png_05.zip ...
downloading Images_png_06.zip ...
downloading Images_png_07.zip ...
downloading Images_png_08.zip ...
downloading Images_png_09.zip ...
downloading Images_png_10.zip ...
Download complete. Please check the MD5 checksums


In [None]:
! ls raw_data


Images_png_01.zip  Images_png_03.zip  Images_png_05.zip  Images_png_07.zip  Images_png_09.zip
Images_png_02.zip  Images_png_04.zip  Images_png_06.zip  Images_png_08.zip  Images_png_10.zip


### Uploading the data onto Google Drive
The subset of data takes around 15-20+ minutes to download. To save ourselves from waiting that much time everytime we want to use the model, we upload the zip files to Google Drive.

Then whenever we need to use this data, we can simply mount Google Drive, unzip the data into the local runtime (takes around 9 minutes), and be on our merry way!

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
! ls gdrive/MyDrive/CS5330/lesion_data

In [None]:
import os
import shutil

local_data_dir = "raw_data/"
gdrive_data_dir = "gdrive/MyDrive/CS5330/lesion_data/images"

for data_zip_name in os.listdir(local_data_dir):
  print("Copying", data_zip_name, "...")
  src_path = local_data_dir + data_zip_name
  shutil.copy2(src_path, gdrive_data_dir)

Copying Images_png_05.zip ...
Copying Images_png_07.zip ...
Copying Images_png_02.zip ...
Copying Images_png_06.zip ...
Copying Images_png_10.zip ...
Copying Images_png_01.zip ...
Copying Images_png_04.zip ...
Copying Images_png_08.zip ...
Copying Images_png_03.zip ...
Copying Images_png_09.zip ...
