## Get Data

In [None]:
from fastbook import *
from fastai.vision.widgets import *

In [None]:
bear_types = 'grizzly','black','teddy'
path = Path('bears')

In [None]:
if not path.exists():
    path.mkdir()
    for bt in bear_types:
        dest = (path/bt)
        dest.mkdir(exist_ok=True)
        urls = search_images_ddg(f'{bt} bear')
        download_images(dest, urls=urls)

In [None]:
fns = get_image_files(path)
fns

In [None]:
failed = verify_images(fns)
failed

In [None]:
failed.map(Path.unlink)

In [None]:
??verify_images

## From data to dataloader

To train a model, we'll need `DataLoaders`, which is an iterator that provides a stream of mini-batches, where each mini-batch is a couple of batches of independent variables and a batch of dependent variables.

To build a DataBlock, there are several steps that needs to be followed. These steps can be asked in the form of questions 
1. What is the types of your input/labels? `Blocks`
2. Where is your data? `get_items`
3. Does something need to be applied to inputs/labels? `get_x, get_y`
4. How to split the data? `splitter`
5. Do we need to apply something on formed items? `item_tfms`
6. Do we need to apply something on formed batches? `batch_tfms`

In [None]:
# DataBlock: Generic container to quickly build Datasets and DataLoaders.
#            blocks(List): One or more Transform blocks.
#                          blocks are used to define a pre-defined problem domain.
#                          e.g, ImageBlock, CategoryBlock, MultiCategoryBlock, TextBlock etc
#                          CategoryBlock: TransformBlock for single-label categorical targets
#            get_items:    Where is the data?
#                          We can use get_image_files function to go grab all the file locations 
#                          of our images.
#            get_y:        How you extract labels. 
#            splitter:     How to split your data. This is usually a random split between the training and 
#                          validation dataset.
#            item_tfms:    Item transform applied on an individual item basis. This is done on the CPU.

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128))

Above command has given us a DataBlock object. This is like a template for creating a DataLoaders. We still need to tell fastai the actual source of our data—in this case, the path where the images can be found:

In [None]:
dls = bears.dataloaders(path)

A `DataLoaders` includes **validation** and **training** `DataLoader`s. `DataLoader` is a class that provides batches of a few items at a time to the GPU. We'll be learning a lot more about this class in the next chapter. When you loop through a `DataLoader` fastai will give you 64 (by default) items at a time, all stacked up into a single tensor. We can take a look at a few of those items by calling the `show_batch` method on a `DataLoader`:

In [None]:
dls.valid.show_batch(max_n=4, nrows=1)

In [None]:
bears = bears.new(item_tfms=Resize(128, ResizeMethod.Squish))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

In [None]:
# In practice, below is used
bears = bears.new(
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms()
)
dls = bears.dataloaders(path)
dls.train.show_batch(max_n=4, nrows=1, unique=True)

## Training Your Model, and Using It to Clean Your Data

In [None]:
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(10)

Now let's see whether the mistakes the model is making are mainly thinking that grizzlies are teddies (that would be bad for safety!), or that grizzlies are black bears, or something else. To visualize this, we can create a *confusion matrix*:

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at all, or are labeled incorrectly, etc.), or a model problem (perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc.). To do this, we can sort our images by their *loss*.

The loss is a number that is higher if the model is incorrect (especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer.

In [None]:
interp.plot_top_losses(5, nrows=2)

fastai includes a handy GUI for data cleaning called `ImageClassifierCleaner` that allows you to choose a category and the training versus validation set and view the highest-loss images (in order), along with menus to allow images to be selected for removal or relabeling:

In [None]:
#hide_output
cleaner = ImageClassifierCleaner(learn)
cleaner

## Turning Your Model into an Online Application