<a href="https://colab.research.google.com/github/tincorpai/Deep_Learning_Pytorch/blob/master/PyTorch_Custom_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is a custom dataset?

We've used some datasets with PyToch before. But how can you get your own data into PyTorch?

One of the ways to do so is via Custom datasets.

Depending on what you're working on, vision, text, audio, recommendation, you'll want to look into each of the PyTorch domain libraries for existing data loading functions and customizable data loading functions.

**Resources:** https://www.learnpytorch.io/04_pytorch_custom_datasets/

**Dataset Creation code** https://github.com/mrdbourke/pytorch-deep-learning/blob/main/extras/04_custom_data_creation.ipynb

## PyTorch Domain Libraries 

|Problem Space         |  Pre-build Datasets and Functions|
----------------------- |----------------------------------|
| Vision               |    torchvision.datasets           |
|Text                  |   torchtext.datsets               |
|Audio                 |   torchaudio.datasets             |
|Recommendation system |   torchrec.datasets               |
|Bonus                 |   TorchData*                      |





## Working with Custom Datasets

*  Getting a custom dataset with PyTorch

*  Becoming one with the data (preparing and visualizing)

*  Transforming data for use with a model

*  Loaing custom data with pre-built functions and custom functions.

*  Building FoodVision Mini to classify images

*  Comparing models with and without data augmentation

*  Making predictions on custom data (data not within our training or testing dataset.

##0. Importing PyTorch and setting up device-agnostic code.

In [1]:
import torch
from torch import nn

# Note: PyTorch 1.10.0+ is required for this course (Check the version of PyTorch)
torch.__version__

'1.13.1+cu116'

In [2]:
# Setup device-agnostic code 
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

##1. Get Dataset

Our dataset is a subset of Food101 dataset.

Food101 starts with 101 different classes of food and 1000 images per class 750 training, 250 testing).

Our dataset starts with 3 classes of food and only 10% of the images (~75 training, 25 testing).

Why do this?


When starting out ML projects, it's important to try things on a small scale and then increae the scale when necessary.

the whole point is to speed up how fast you can experiment.


In [4]:
import requests
import zipfile
from pathlib import Path

#Setup path to a data folder
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi"

#if the image folder doesn't exist, download it and prepare it and prepare it...
if image_path.is_dir():
  print(f"{image_path} directory already exists... skipping download")
else:
  print(f"{image_path} does not exist, creating one...")
  image_path.mkdir(parents=True, exist_ok=True)


#Download pizza, steak and sushi data
with open(data_path / "pizza_steak_sushi.zip", "wb") as f:
  request = requests.get("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip")
  print("Download pizza, steak, sushi, sushi data...")
  f.write(request.content)

#unzip pizza, steak, sushi data
with zipfile.ZipFile(data_path / "pizza_steak_sushi.zip", "r") as zip_ref:
  print("Unzipping pizza, steak, sushi data...")
  zip_ref.extractall(image_path)

data/pizza_steak_sushi directory already exists... skipping download
Download pizza, steak, sushi, sushi data...
Unzipping pizza, steak, sushi data...


##2. Understanding the dataset (data preparation and data exploration)

In [7]:
import os
def walk_through_dir(dir_path):
  """Walks through dir_path returning its contents."""
  for dirpath, dirnames, filenames in os.walk(dir_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")


In [8]:
walk_through_dir(image_path)

There are 2 directories and 0 images in 'data/pizza_steak_sushi'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi/train'.
There are 0 directories and 75 images in 'data/pizza_steak_sushi/train/steak'.
There are 0 directories and 78 images in 'data/pizza_steak_sushi/train/pizza'.
There are 0 directories and 72 images in 'data/pizza_steak_sushi/train/sushi'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi/test'.
There are 0 directories and 19 images in 'data/pizza_steak_sushi/test/steak'.
There are 0 directories and 25 images in 'data/pizza_steak_sushi/test/pizza'.
There are 0 directories and 31 images in 'data/pizza_steak_sushi/test/sushi'.


### Standard Image Classification Data Format

