# Carbon footprint of storing all those datasets in Kaggle

You could calculate the footprint of storage by considering: the energy consumption of an average 5TB HDD/SSD per hour aggregated for let’s say a year, the proportion used by the datasets, a redundancy factor of 2 or 3, the average PUE of data centers, the average carbon intensity of the locations where they’re stored if known. You multiply everything and get the amount of co2 equiv. This would exclude the CPU power needed to access the storage. You can go into more detail by also considering networking and embodied emissions, as they do here: https://www.cloudcarbonfootprint.org/docs/methodology/#storage

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv("../data/01_isic_datasets_metadata.csv")
df

Unnamed: 0,@context.@language,@context.@vocab,@type,name,alternateName,description,url,identifier,creator.@type,creator.name,...,distribution_0.@type,contentUrl,contentSize,encodingFormat,isPrivate,downloadCount,viewCount,voteCount,usabilityRating,conformsTo
0,en,https://schema.org/,Dataset,Skin Cancer ISIC,The skin cancer data. Contains 9 classes of sk...,,https://www.kaggle.com/nodoubttome/skin-cancer...,319080,Person,Andrey Katanskiy,...,DataDownload,https://www.kaggle.com/datasets/nodoubttome/sk...,2048.0,application/zip,False,16375,132171,220,0.750000,http://mlcommons.org/croissant/1.0
1,en,https://schema.org/,Dataset,All ISIC Data 20240629,All images and metadata in ISIC archive.,,https://www.kaggle.com/tomooinubushi/all-isic-...,5302785,Person,tomoo inubushi,...,DataDownload,https://www.kaggle.com/datasets/tomooinubushi/...,75776.0,application/zip,False,376,3489,55,0.764706,http://mlcommons.org/croissant/1.0
2,en,https://schema.org/,Dataset,ISIC 2020 JPG 256x256 RESIZED,,,https://www.kaggle.com/nischaydnk/isic-2020-jp...,5295545,Person,Nischay Dhankhar,...,DataDownload,https://www.kaggle.com/datasets/nischaydnk/isi...,595.0,application/zip,False,709,2149,48,0.882353,http://mlcommons.org/croissant/1.0
3,en,https://schema.org/,Dataset,ISIC 2019 JPG 224x224 RESIZED,ISIC 2019 resized dataset,,https://www.kaggle.com/nischaydnk/isic-2019-jp...,5295517,Person,Nischay Dhankhar,...,DataDownload,https://www.kaggle.com/datasets/nischaydnk/isi...,355.0,application/zip,False,561,1930,40,0.941176,http://mlcommons.org/croissant/1.0
4,en,https://schema.org/,Dataset,JPEG ISIC 2019 512x512,,,https://www.kaggle.com/cdeotte/jpeg-isic2019-5...,762203,Person,Chris Deotte,...,DataDownload,https://www.kaggle.com/datasets/cdeotte/jpeg-i...,1024.0,application/zip,False,2445,7096,54,0.588235,http://mlcommons.org/croissant/1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
855,en,https://schema.org/,Dataset,melanoma_isic,,,https://www.kaggle.com/chitrapsg/melanoma-isic,3501446,Person,Chitra Govindasamy,...,DataDownload,https://www.kaggle.com/datasets/chitrapsg/mela...,786.0,application/zip,False,3,81,0,0.000000,http://mlcommons.org/croissant/1.0
856,en,https://schema.org/,Dataset,4000ISIC19Balanced,,,https://www.kaggle.com/manirujjamanmonir/4000i...,3374258,Person,Manirujjaman Monir,...,DataDownload,https://www.kaggle.com/datasets/manirujjamanmo...,6144.0,application/zip,False,2,73,0,0.000000,http://mlcommons.org/croissant/1.0
857,en,https://schema.org/,Dataset,siim_isic_2020_leukemia_dataset,,,https://www.kaggle.com/rajibbag1/siim-isic-202...,4911856,Person,RAJIB BAG_1,...,DataDownload,https://www.kaggle.com/datasets/rajibbag1/siim...,2048.0,application/zip,False,0,20,0,0.000000,http://mlcommons.org/croissant/1.0
858,en,https://schema.org/,Dataset,data_isic1718,,,https://www.kaggle.com/bugakakak/data-isic1718,4788617,Person,bugakakak,...,DataDownload,https://www.kaggle.com/datasets/bugakakak/data...,261.0,application/zip,False,0,17,0,0.000000,http://mlcommons.org/croissant/1.0


In [7]:
df = df.drop(df[df.contentSize == "Unknown"].index)
df['contentSize'] = df['contentSize'].astype(float)

In [8]:
total_size_mb = df["contentSize"].sum()

total_size_gb = total_size_mb / 1024
total_size_tb = total_size_gb / 1024

print(f'Total size of datasets: {total_size_gb:.2f} GB')
print(f'Total size of datasets: {total_size_tb:.2f} TB')

Total size of datasets: 2642.14 GB
Total size of datasets: 2.58 TB
