## Introduction

I want to make a bird detector that works fairly reliably, when given any sort of input image, not only images it was trained on.

A multi-label classifier for birds, cats, and dogs should be a good starting point.

In [None]:
%load_ext autoreload
%autoreload 2

## install requirements

* maybe I should use a fresh venv copied from ai base for each project?

In [None]:
!pip install -qq -U fastai ipywidgets pillow pillow-avif-plugin inflect
!pip install -qq -U selenium webdriver_manager retry
!pip install -qq -U duckduckgo_search
!pip install -qq -U clip-retrieval
!pip install protobuf==3.20.0

In [None]:
!jupyter nbextension enable --py --sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## import required libs

In [None]:
import pillow_avif
from fastai.vision.all import *
from send2trash import send2trash

In [None]:
import sys
from duckduckgo_search import ddg_images
from clip_retrieval.clip_client import ClipClient

In [None]:
sys.path.append('.')

In [None]:
# this is my own google image search library / tool
from google_images import install_webdriver, start_chrome, google_image_search

## reusable functions

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

In [None]:
from itertools import chain, combinations
import inflect
from fastcore.foundation import L
import ipywidgets as widgets
import logging

In [None]:
def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [None]:
p = inflect.engine()

In [None]:
def join_a_foo_and_a_bar(comb):
    return " and ".join(p.a(x) for x in comb)

In [None]:
join_a_foo_and_a_bar(["banana", "apple"])

'a banana and an apple'

In [None]:
def seq_diff(s1, s2):
    return L(filter(lambda x: x not in s2, s1))

In [None]:
def test_seq_diff():
    labels = L("bird", "cat", "dog")
    for comb in powerset(labels):
        print(comb, seq_diff(labels, comb))
        
test_seq_diff()

() ['bird', 'cat', 'dog']
('bird',) ['cat', 'dog']
('cat',) ['bird', 'dog']
('dog',) ['bird', 'cat']
('bird', 'cat') ['dog']
('bird', 'dog') ['cat']
('cat', 'dog') ['bird']
('bird', 'cat', 'dog') []


In [None]:
def confirm_delete(del_path):
    button = widgets.Button(description=f"Move data to trash: {del_path}?", layout=widgets.Layout(width='20em'))
    # button.on_click(lambda b: shutil.rmtree(del_path, ignore_errors=True))
    button.on_click(lambda b: send2trash(del_path))
    display(button)

In [None]:
def setup_logging(quiet=False, debug=False):
    if debug:
        logging.basicConfig(level=logging.DEBUG, format="%(asctime)s %(levelname)s %(message)s")
    elif quiet:
        logging.basicConfig(level=logging.ERROR, format="%(message)s")
    else:
        logging.basicConfig(level=logging.INFO, format="%(message)s")

## general setup

In [None]:
data = Path("bird_cat_dog")

In [None]:
null_query = "photo of outdoors"
labels = ["bird", "cat", "dog"]
query_prefix = "photo of "

In [None]:
samples_per_query = 200

## 1. find images using duckduckgo_search

In [None]:
searcher = "ddg"

In [None]:
confirm_delete(data/searcher)

Button(description='Move data to trash: bird_cat_dog/ddg?', layout=Layout(width='20em'), style=ButtonStyle())

In [None]:
for comb in powerset(labels):
    others = seq_diff(labels, comb)
    dirname = "_".join(comb) + "_"
    path = data/searcher/dirname
    query = query_prefix + join_a_foo_and_a_bar(comb) if comb else null_query
    query += " " + " ".join("-"+x for x in others)
    try:
        path.mkdir(parents=True, exist_ok=False)
    except FileExistsError as e:
        print(f"already downloaded: {query}")
        continue
    print(f"downloading: {query}")
    # want any = CC to avoid stock photos, but not many CC images have all three
    license = "any" if len(comb) < 2 else None
    results = ddg_images(query, max_results=samples_per_query, license_image=license)
    urls = [r["image"] for r in results]
    download_images(dest=path, urls=urls)

already downloaded: photo of outdoors -bird -cat -dog
already downloaded: photo of a bird -cat -dog
already downloaded: photo of a cat -bird -dog
already downloaded: photo of a dog -bird -cat
already downloaded: photo of a bird and a cat -dog
already downloaded: photo of a bird and a dog -cat
already downloaded: photo of a cat and a dog -bird
already downloaded: photo of a bird and a cat and a dog 


## 2. find images using laion

Deduplication means that fewer than `samples_per_query` images will be returned, around 75% or so.

In [None]:
searcher = "laion"

In [None]:
confirm_delete(data/searcher)

Button(description='Move data to trash: bird_cat_dog/laion?', layout=Layout(width='20em'), style=ButtonStyle()…

In [None]:
laion = ClipClient(
    url="https://knn.laion.ai/knn-service",
    indice_name="laion5B-H-14",
    aesthetic_score=0, aesthetic_weight=0,
    num_images=samples_per_query)

In [None]:
for comb in powerset(labels):
    dirname = "_".join(comb) + "_"
    path = data/searcher/dirname
    query = query_prefix + ", ".join(comb) if comb else null_query
    try:
        path.mkdir(parents=True, exist_ok=False)
    except FileExistsError as e:
        print(f"already downloaded: {query}")
        continue
    print(f"downloading: {query}")
    results = laion.query(text=query)
    urls = [r["url"] for r in results]
    download_images(dest=path, urls=urls)

downloading: photo of outdoors
downloading: photo of bird
downloading: photo of cat
downloading: photo of dog
downloading: photo of bird, cat
downloading: photo of bird, dog
downloading: photo of cat, dog
downloading: photo of bird, cat, dog


### references

- https://replicate.com/blog/grab-hundreds-of-images-with-clip-and-laion
- https://github.com/rom1504/clip-retrieval
- https://rom1504.github.io/clip-retrieval/

## 3. find images using Google search

In [None]:
searcher = "gimg"

In [None]:
confirm_delete(data/searcher)

Button(description='Move data to trash: bird_cat_dog/laion?', layout=Layout(width='20em'), style=ButtonStyle()…

In [None]:
with start_chrome() as wd:
    for comb in powerset(labels):
        others = seq_diff(labels, comb)
        dirname = "_".join(comb) + "_"
        path = data/searcher/dirname
        query = query_prefix + join_a_foo_and_a_bar(comb) if comb else null_query
        query += " " + " ".join("-"+x for x in [*others, "stock"])
        try:
            path.mkdir(parents=True, exist_ok=False)
        except FileExistsError as e:
            print(f"already downloaded: {query}")
            continue
        print(f"downloading: {query}")
        # want any = CC to avoid stock photos, but not many CC images have all three
        opts = "tbs=il:cl" if len(comb) < 2 else ''
        urls = google_image_search(query, safe=True, n=samples_per_query, opts=opts, wd=wd)
        download_images(dest=path, urls=urls)

INFO:WDM:Get LATEST chromedriver version for google-chrome 111.0.5563
INFO:WDM:Driver [/home/sam/.wdm/drivers/chromedriver/linux64/111.0.5563/chromedriver] found in cache


downloading: photo of outdoors -bird -cat -dog -stock
downloading: photo of a bird -cat -dog -stock
downloading: photo of a cat -bird -dog -stock
downloading: photo of a dog -bird -cat -stock
downloading: photo of a bird and a cat -dog -stock
downloading: photo of a bird and a dog -cat -stock
downloading: photo of a cat and a dog -bird -stock
downloading: photo of a bird and a cat and a dog -stock


## clean up the downloaded images

This is too much effort, should have just deleted any failing ones. I'm not sure why the AVIF images wouldn't load.

In [None]:
from fastai.vision.all import *

In [None]:
!find bird_cat_dog/ -type f | xargs rename 's/\..*/.jpg/'

In [None]:
!find bird_cat_dog/ -size 0 | xargs rm -v

removed 'bird_cat_dog/gimg/dog_/3f824c0e-aa4b-4395-acdd-15a4243f0e41.jpg'
removed 'bird_cat_dog/laion/bird_/a9ff3712-5401-49e7-a9e9-1946a6ef46a9.jpg'
removed 'bird_cat_dog/laion/dog_/96b30f11-eb9a-4960-a41c-667bcd3d4b61.jpg'


In [None]:
failed = verify_images(get_image_files(data))

In [None]:
import subprocess
with open("failed.txt", "w") as f:
    for p in failed:
        print(p, file=f)

In [None]:
!< failed.txt xargs file | sed 's/.*: *//' | sed 's/ (.*//' | uniqoc

13	HTML document, ASCII text, with very long lines
6	JavaScript source, ASCII text, with very long lines
3	HTML document, ASCII text
1	HTML document, Unicode text, UTF-8 text, with CRLF, LF line terminators
7	HTML document, Unicode text, UTF-8 text, with very long lines
2	ISO Media, AVIF Image
1	gzip compressed data, from Unix, original size modulo 2^32 98001
2	HTML document, Unicode text, UTF-8 text


In [None]:
!< failed.txt xargs file | grep AVIF

bird_cat_dog/gimg/bird_cat_/7ad0c40a-afaa-4b63-938d-8d661a5c2f7f.jpg:     ISO Media, AVIF Image
bird_cat_dog/gimg/cat_dog_/cac07371-a384-4849-ad10-b7d143fcc4d2.jpg:      ISO Media, AVIF Image


In [None]:
!< failed.txt xargs file | grep AVIF | cut -d: -f1 | xa mogrify

In [None]:
len(failed)

35

In [None]:
failed = verify_images(get_image_files(data))

In [None]:
len(failed)

33

In [None]:
!< failed.txt xargs file | grep gzip | cut -d: -f1 | xa rename -v 's/$/.gz/'

bird_cat_dog/laion/bird_/d9401f3b-c20e-4607-aa03-5a6049d714f5.jpg renamed as bird_cat_dog/laion/bird_/d9401f3b-c20e-4607-aa03-5a6049d714f5.jpg.gz


In [None]:
!gunzip bird_cat_dog/laion/bird_/d9401f3b-c20e-4607-aa03-5a6049d714f5.jpg.gz

In [None]:
failed = verify_images(get_image_files(data))

In [None]:
len(failed)

33

In [None]:
failed.map(Path.unlink)

(#33) [None,None,None,None,None,None,None,None,None,None...]

In [None]:
!find bird_cat_dog/ -type f | xa file | perl -pe 's/^.*?: *//; s/,.*//' | uniqoc

3147	JPEG image data
294	PNG image data
467	RIFF (little-endian) data
15	GIF image data
50	ISO Media


Let's rename the images again to give them the right file extensions.

In [None]:
!find bird_cat_dog/ -type f | xa file | head -n 2

bird_cat_dog/ddg/bird_/4034a955-d434-47b1-874e-10dd77414160.jpg:         JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, progressive, precision 8, 1920x1271, components 3
bird_cat_dog/ddg/bird_/f3ae56d2-ebbe-4f09-b435-1d60f61f4e6a.jpg:         PNG image data, 500 x 401, 8-bit/color RGBA, interlaced


In [None]:
!find bird_cat_dog/ -type f | xa file | perl -ne '/(.*?)(\.\w+):\s*(\w+)/; print "$1 $2 ", lc $3, "\n"' | head -n2

bird_cat_dog/ddg/bird_/4034a955-d434-47b1-874e-10dd77414160 .jpg jpeg
bird_cat_dog/ddg/bird_/f3ae56d2-ebbe-4f09-b435-1d60f61f4e6a .jpg png
Unable to flush stdout: Broken pipe


In [None]:
images = get_image_files(data)
len(images)

3973

In [None]:
!find bird_cat_dog/ -type f | xa file | perl -ne '/(.*?)(\.\w+):\s*(\w+)/; rename("$1$2", "$1.".lc($3))'

In [None]:
images = get_image_files(data)
len(images)

3456

In [None]:
get_image_files??

[0;31mSignature:[0m [0mget_image_files[0m[0;34m([0m[0mpath[0m[0;34m,[0m [0mrecurse[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mfolders[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mget_image_files[0m[0;34m([0m[0mpath[0m[0;34m,[0m [0mrecurse[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mfolders[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"Get image files in `path` recursively, only in `folders`, if specified."[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0mget_files[0m[0;34m([0m[0mpath[0m[0;34m,[0m [0mextensions[0m[0;34m=[0m[0mimage_extensions[0m[0;34m,[0m [0mrecurse[0m[0;34m=[0m[0mrecurse[0m[0;34m,[0m [0mfolders[0m[0;34m=[0m[0mfolders[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      /opt/venvs/python3.10-ai/venv/lib/python3.10/site-packages/fastai/data/transforms.py
[0;31mType:[0m      function

In [None]:
{'.jpeg','.png','.riff','.gif','.iso'} - image_extensions

{'.iso', '.riff'}

In [None]:
!find bird_cat_dog/ -type f | xargs rename 's/\.riff$/.webp/; s/\.iso$/.avif/'

In [None]:
images = get_image_files(data)
len(images)

3973

In [None]:
mv -v bird_cat_dog bird_cat_dog.1.orig

renamed 'bird_cat_dog' -> 'bird_cat_dog.orig'


## resize the images to not larger than 400x400

In [None]:
resize_images('bird_cat_dog.1.orig', dest='bird_cat_dog.2.resized', max_size=400, recurse=True)

In [None]:
!du -csh bird_cat_dog*

2.0G	bird_cat_dog.1.orig
101M	bird_cat_dog.2.resized
2.1G	total


## combine images from the different search engines together

In [None]:
cd ~/ai/blog/posts/multilabel

/home/sam/ai/blog/posts/multilabel


In [None]:
ls bird_cat_dog.2.resized/

[0m[01;34mddg[0m/  [01;34mgimg[0m/  [01;34mlaion[0m/


In [None]:
!cp -al bird_cat_dog.2.resized bird_cat_dog.3.together

In [None]:
cd bird_cat_dog.3.together/

/home/sam/ai/blog/posts/multilabel/bird_cat_dog.3.together


In [None]:
!(mkdir -p $(cd ddg; ls); rename 's{(\w+)/(\w+)/(.*)}{$2/${1}_$3}' */*/*)

In [None]:
!find . -mindepth 1 -depth -type d | xa rmdir --ignore-fail-on-non-empty

In [None]:
cd ..

/home/sam/ai/blog/posts/multilabel


## check for exact duplicates

In [None]:
cp -al bird_cat_dog.3.together bird_cat_dog.4.dedup

In [None]:
cd bird_cat_dog.4.dedup/

/home/sam/ai/blog/posts/multilabel/bird_cat_dog.4.dedup


Automatically delete duplicates with the same labels:

In [None]:
!for d in *; do fdupes -f -r $d; done | grep . | xa rm

                                        

Manually delete duplicates with different labels:

In [None]:
!fdupes -r . | grep . | xa qiv -D

                                        

## check for similar images

Ideally, I would show each set of duplicates together across a row, visually check them, eliminate any that aren't in fact duplicates, and keep the best quality image from each set. I might not be able to determine which is best quality automatically after having resized them, so I might need to do step before resizing.

In [None]:
!findimagedupes -R -f dups.db . 2>dups.err.txt > dups.txt

In [None]:
!< dups.txt xargs qiv -D

## create a CSV file describing the data

I'll put 20% of each directory of images into the validation set.

In [None]:
images = get_image_files(data)
len(images), images[:1]

(3973,
 (#1) [Path('bird_cat_dog/ddg/bird_/8f0a00b6-6140-40dd-81e9-c34bb0d37e15.jpeg')])

## train the model

In [None]:
from fastai.vision.all import *

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
## scratch

In [None]:
!mvdata bird_cat_dog

mkdir: created directory '/home/sam/ai/data/blog'
mkdir: created directory '/home/sam/ai/data/blog/posts'
mkdir: created directory '/home/sam/ai/data/blog/posts/multilabel'
mv   renamed 'bird_cat_dog' -> '/home/sam/ai/data/blog/posts/multilabel/bird_cat_dog'
ln   'bird_cat_dog' -> '/home/sam/ai/data/blog/posts/multilabel/bird_cat_dog'


In [None]:
data.readlink()

Path('/home/sam/ai/data/blog/posts/multilabel/bird_cat_dog')

In [None]:
!mv ~/ai/data/blog/posts/multilabel/{bird_cat_dog,bird_cat_dog.orig}