# Let's get started

## Pickling Cats

In this tutorial, we will pickle some cats. We will only need to pickle the cat's keys, because the main source of cat pics will remain unmoved where they first sat down for a nap &#151; that is to say, in ***cats/source***. All this tutorial's work will be in ***cats/thumbs***. I've chosen to hardwire "thumbs", but have kept cats flexible. If you need project namespaces I suggest changing cats to dogs or something else more suitable.

### Lab_Black

First, good habits. Running `%load_ext lab_black` makes all your code format to uncompromoisingly compliant with some such-and-such. But it's pretty and works, so I use it. Most comments from here forward are Python comments in the code-blocks.

In [None]:
%load_ext lab_black

In [None]:
# I load a few extrnal libraries here.
# Examples run on a Linux JupyterLab per https://mikelev.in/ux
# We don't need much at first. I'll do more of these later.
# Package import statements like these load resources taht are
# not loaded by default. Some have already been pip installed.
# Go ahead and run this cell.

from time import time  # To measure how many seconds things took
from pathlib import Path  # To read & write to the local drive
import matplotlib.pyplot as plt  # To display data (graphs)
from pickle import loads, dumps  # Load/save Python data to drive.
from sqlitedict import SqliteDict as sqldict  # Treat dicts as dbs.

print("Very good. Now you can run the next cell.")
print("Done")

In [None]:
# The global data variable becomes your file read/write namespace.
# Choose something you want as a directory, folder, git repo name.
# You can work in any folder and this will still organize your data.
# Watch for a folder appearing next to this Notebook called "cats"...
# (If cats already exists, it skips this step)

data = "cats"
Path(data).mkdir(exist_ok=True)
print("Done")

In [None]:
# Let's count from 0 to 9 in Python.
# Notice how we change the "end of line" behavior of print.
for i in range(10):
    print(i, end=" ")
print("Done")

In [None]:
# Let's get visual right away. It's so easy in Jupyter, why not?
# Let's plot from 0 to 9 in matplotlib. Both X and Y contain 0 to 9.

plt.title("When X = Y")
plt.xlabel("range(10)")
plt.ylabel("range(10)")
plt.plot(range(10), range(10))
plt.show()
print("Done")

In [None]:
# Let's count from 1 to 10 in .
# It's worth knowing that you CAN start counting from 1.

for i in range(1, 11):
    print(i, end=" ")
print("Done")

In [None]:
# Though you can make it 1 when yu need it, doesn't it just look
# so much nicer like this? Don't make it harder than you need to.

for i in range(10):
    print(i + 1, end=" ")
print("Done")

In [None]:
# There are so many cases where you have to do something every X-rows
# we look at that first. Notice the Modulo operator. On every 1000 rows
# the calculation returns 0 so that if-line evaluates true. It's a good
# way to make count-down timers that don't over-print to your Notebook.

for i in range(100000):
    if not i % 1000:
        print(i, end=" ")
print("Done")

In [None]:
# You are now in Wonderland. Everything in math you thought too abstract
# to do you any good in real-life now does you good. Let's use powers of 10
# to restate the above. See? Isn't it nice to just know how many 0's?

for i in range(10**5):  # Ten with 5 0's is 100,000.
    if not i % 10**3:  # Ten with 3 0's is 1,000.
        print(i, end=" ")
print("Done")

In [None]:
# Numbers are most easily read when formatted well for reading.
# We will be making heavy use of "f-strings". Notice the colon-comma:

for i in range(10**5):
    if not i % 10**3:
        print(f"{i:,}", end=" ")
print("Done")

In [None]:
# Since we're working on formatting, you ought to know you have a
# rich environment for outputting structures like tables.
# While you're at it, get used to seing tuples get splat.
# The table-formatter is Rich. I don't use it much, but
# you do almost every time you pip install.

from rich.table import Table

table = Table(show_header=True, header_style="bold red")
table.add_column("thou_step", style="green", width=12)
table.add_column("foo", style="cyan", width=12)

for i in range(10**5):
    if not i % 10**3:
        atuple = (f"{i:,}", "bar")
        table.add_row(*atuple)
table

In [None]:
# A more popular way to view your data is as "Pandas DataFrames".
# These are in-memory Excel tabs or SQL tables, if you will.
# They're not really databases, but are often used as such.
# We will make one here from this loop. We stop f-string comma formatting
# because as a pd df, we can do it Pythonically pedantically idiomatically.
# I use his approach all the time.

import pandas as pd

table = []
for i in range(10**5):
    if not i % 10**3:
        table.append(i)
df = pd.DataFrame(table, columns=["thousands"])
df

In [None]:
# Take a look at the data-types of that DataFrame.
df.dtypes

In [None]:
# To format a number with commas, we make it a float.
# It now displays commas in floats used in Pandas in this Notebook
# But when you save or convert the data, the commas are not used.
# Pandas DataFrames are great for Excel or SQL-like things in Python
# except without having to have a database. You load and save files.

pd.options.display.float_format = "{:,}".format

df = df.astype(float)
df

In [None]:
# This df can be saved to your drive as data in various ways.
# Here we will use one of the best alternatives to CSV for large files.

df.to_parquet(f"{data}/notbutter.parquet")
print("Done")

In [None]:
# And you can load it back off of the drive.
# Yes, I'm recycling the same variable names, but trust me.

df = pd.read_parquet(f"{data}/notbutter.parquet")
df

In [None]:
# What's it like to write a million lines into a text file?
filename = f"{data}/text.txt"
with open(filename, "wt") as fh:
    for i in range(1000000):
        fh.write(f"{i}\n")
print("Done")  # Fast!!!

In [None]:
# Okay, how about a million lines into a Pandas DataFrame then parquet it?
table = []
for i in range(10**6):
    table.append(i)
df = pd.DataFrame(table, columns=["thousands"])
df.to_parquet(f"{data}/morenotbutter.parquet")
print("Done")

In [None]:
# How big is a million-line Pandas DataFrame stored in parquet format?
filename = f"{data}/morenotbutter.parquet"
bytesize = Path(filename).stat().st_size
kilo = 1000
print(f'The file "{filename}" is {bytesize:,} Bytes.')
print(f"Abbreviated to {bytesize / kilo:,.0f} Kilobytes.")  # The :,0f formats
print(f"Or just {bytesize / kilo / kilo:.0f} Megs.")
print("Done")
# A million short lines is still several megs.

In [None]:
# Showing those file sizes seems useful.
# Can we make that a reusable function?


def get_size(afile):
    filename = f"{afile}"
    bytesize = Path(filename).stat().st_size
    kilo = 1000
    print(f'The file "{filename}" is {bytesize:,} Bytes.')
    print(f"Abbreviated to {bytesize / kilo:,.0f} Kilobytes.")  # The :,0f formats
    print(f"Or just {bytesize / kilo / kilo:.0f} Megs.")


get_size(f"{data}/morenotbutter.parquet")

In [None]:
# What about with a real database? Let's use Python's SQLite.
# What's it like to write 100,000 keys into a SQlite database?
# Let's count down instead of up. Keys go up but count is down.
# As you can see, having "real" database operations like unqiuness
# enforced with Primary keys on Insert statements have a time cost.
# Sometime's it's worth it, as with crawer data collection, but
# in most cases, we pass on a real database in favor of parquet.

filename = f"{data}/database.db"
now = time()
upto = 100000
with sqldict(filename) as db:
    for i in range(upto):
        db[i] = None
        if not i % 10000:
            db.commit()
            print(f"{upto - i:,}...", end=" ")
seconds = int(time() - now)
get_size(filename)
print(f"\nDone ({seconds} seconds)")

In [None]:
# Let's count down from a billion by hundred-millions.
# The point of running this is just to appreciate that big numbers are big.
# It can take a minute to count down from a billion on your laptop.

hundredmillion = 10**8
billion = 10**9
now = time()
modulorow = 1
print("Count down with me from a billion in Python:")
for i in range(billion):
    if not i % hundredmillion:
        glimpse = int(time() - now)
        print(f"{modulorow}: {billion - i:,} ({glimpse} sec)...")
        modulorow += 1
seconds = int(time() - now)
print(f"Done ({seconds} seconds)")  # Computers are fast but not that fast

In [None]:
# That's a billion. Don't let it scare you. Hundreds of millions can be OK
# if you use the right techniques to load, process and save data.
# In this example we

filename = f"{data}/pickledump.pkl"
seen = set()
ten_million = 10**6 * 10
print(f"{ten_million:,}")

for i in range(ten_million):
    seen.add(i)
print(f"Made {len(seen):,} keys.")

# Dump pickled set to file
with open(filename, "wb") as fh:
    fh.write(dumps(seen))
print(f'Saved "{filename}" to drive.')

get_size(filename)

# Load picled set out of file"
with open(filename, "rb") as fh:
    seen = loads(fh.read())
print(f"Read native Python {type(seen)} back off of drive.")
print(f"{ten_million:,} is no biggie using these techniques.")
print("Done")

In [None]:
# Now we're getting down to business. Cats!
# We load a lot of packgees here so I have an easy entry-point later.

import shutil
import pandas as pd
from PIL import Image
from httpx import get
from io import BytesIO
from time import sleep
from pathlib import Path
from collections import Counter
from pickle import loads, dumps
from imagehash import phash, whash
from IPython.display import display
from PIL.PngImagePlugin import PngInfo

# Where we save cats and generate thumbs.
# This also sets the stage for auto-classifiers.
data = "cats"
source = f"{data}/source"
thumbs = f"{data}/thumbs"
tagtable = Path(f"{data}/tagtable.pkl")
by_types = ["by_folder", "by_ham", "by_size"]

# Make those locatiosn if they don't exist.
Path(data).mkdir(exist_ok=True)
Path(source).mkdir(exist_ok=True)
Path(thumbs).mkdir(exist_ok=True)
print("Done")

In [None]:
# Download 30 cats that don't exist.

# If you actually want to fetch 30 cats that don't exist again
# then delete the contents of cats/source folder and re-run.
# You can delete just a sinlge cat from source and watch it re-fill
# except by doing so removes referenced data. Fetch more. Whatever.

url = "https://thiscatdoesnotexist.com/"
cats = 30
for i in range(cats):
    filename = f"{source}/cat-{str(i).zfill(3)}.jpg"
    if not Path(filename).exists():
        print(f"{cats - i} Downloading: {filename}")
        response = get(url)
        img = Image.open(BytesIO(response.content))
        img.save(filename)
        sleep(1)
print("Done")

In [None]:
# Generate thumbnails for the source folder of cat images.
# If you wish to see the thumbnails generate again, you have to
# delete seencats.pkl and the contents of thumbs folder.

size = 64

# Load set of seen cats from pickle if exists.
pickled_cats = f"{data}/seencats.pkl"
if Path(pickled_cats).exists():
    with open(pickled_cats, "rb") as fh:
        seen = loads(fh.read())
else:
    seen = set()

# Make thumbnails of cat pics.
for cat in Path(source).glob("*.jpg"):
    img = Image.open(cat)
    thumb = img.copy()
    thumb.thumbnail((size, size))
    awhash = whash(img, hash_size=8)
    width, height = img.width, img.height
    bands = "".join(img.getbands())
    meta_data = {
        "filename": cat.name,
        "width": width,
        "height": height,
        "format": img.format,
        "format_description": img.format_description,
        "bands": img.getbands(),
        "extremes": img.getextrema(),
        "xmp": img.getxmp(),
    }
    pi = PngInfo()
    for meta in meta_data:
        pi.add_text(meta, f"{meta_data[meta]}")
    filename = f"{width}x{height}_{awhash}_.png"
    if filename not in seen:
        print(cat)
        display(thumb)
        seen.add(filename)
        print(filename)
        thumb.save(
            f"{thumbs}/{filename}",
            "PNG",
            pnginfo=pi,
            save_all=True,
        )
        print()
with open(pickled_cats, "wb") as fh:
    fh.write(dumps(seen))

# Report size of file
bytesize = Path(pickled_cats).stat().st_size
print(f"{pickled_cats} is {bytesize:,} Bytes")

print("Done")

In [None]:
def size_name(n):
    sizes = {
        4: "Ten Thousand",
        5: "Hundred Thousand",
        6: "Million",
        9: "Billion",
        12: "Trillion",
        15: "Quadrillion",
        18: "Quintillion",
        21: "Sextillion",
        24: "Septillion",
        27: "Octillion",
        30: "Nonillion",
        33: "Decillion",
        36: "Undecillion",
        39: "Duodecillion",
        42: "Tredecillion",
        45: "Quattuordecillion",
        48: "Quindecillion",
        51: "Sexdecillion",
        54: "Septendecillion",
        57: "Octodecillion",
        60: "Novemdecillion",
        63: "Vigintillion",
    }
    exponent = len(str(n)) - 1
    exponent -= exponent % 3
    size = sizes.get(exponent, "extremely large")
    return size


# Notice how some cats are more hexed than others.
print("How unique can a 16-digit hexidecimal number really be?")
print()
print("Filename_extract converted_2hex decimal big_number_name...")
for cat in seen:
    parts = cat.split("_")
    whash = parts[1]
    ahex = hex(int(whash, 16))
    adec = int(ahex, 16)
    word = size_name(adec)

    print(whash, ahex, f"{adec:,}", word)
print("Done")

In [None]:
from os import scandir


def build_cdict(path):
    global sort_choice, by_types, cdict, seen, seensizes, tags
    for entry in scandir(path):
        if entry.is_dir(follow_symlinks=False):
            try:
                build_cdict(entry.path)
            except:
                continue
        else:
            try:
                found = entry.stat(follow_symlinks=False)
            except:
                continue
            name, path = entry.name, entry.path
            seen.add(name)
            path = path.split("/")
            parts = name.split("_")
            size = parts[0]
            cdict[name] = "/".join(path[:-1])
            if sort_choice == "by_folder":
                classifications = path[2:-1]
                classifications = [
                    x
                    for x in classifications
                    if not x.isnumeric() and x not in by_types
                ]
            elif sort_choice == "by_size":
                classifications = [size]
            else:
                classifications = []
            if classifications:
                tuples = [(name, tag) for tag in classifications]
                [tags.add(atuple) for atuple in tuples]
    return cdict


print("Done")

In [None]:
# Calculate minimum hamming distance, seen sizes and histo distances.

import matplotlib.pyplot as plt  # To display data (graphs)

# Update cdict with latest file locations
cdict = {}
tags = set()
seen = set()
seensizes = set()
ham_goes = {}
hist_goes = {}

# First we auto-classify by width x height formats.
sort_choice = ""
cdict = build_cdict(thumbs)
hamdiffs = Counter()
cat_pairs = set()
hdict = {}
for cat1 in cdict:
    parts1 = cat1.split("_")
    file_path = f"{cdict[cat1]}/{cat1}"
    img = Image.open(file_path)
    hdict[cat1] = img.histogram()
    washcat1 = parts1[1]
    for cat2 in cdict:
        parts2 = cat2.split("_")
        washcat2 = parts2[1]
        int1, int2 = [int(x, 16) for x in (washcat1, washcat2)]
        if int1 != int2:
            diff = bin(int1 ^ int2).count("1")
            append_list = [int(diff)]
            catpairdiff = tuple(sorted([washcat1, washcat2]) + append_list)
            hamdiffs[diff] += 1
            cat_pairs.add(catpairdiff)
sorted_dict = dict(sorted(hamdiffs.items(), key=lambda item: item[0], reverse=False))
plt.bar(hamdiffs.keys(), hamdiffs.values())
plt.xticks(rotation=90)
plt.title("Hamming Distance Groups")
plt.show()

df = pd.DataFrame(cat_pairs, columns=["cat1", "cat2", "ham"])
min_hams = set()
for cat in df.groupby("cat1"):
    name, dfg = cat
    min_ham = dfg.ham.min()
    min_ham = str(min_ham).zfill(2)
    min_hams.add(min_ham)
    ham_goes[name] = min_ham
for whash in {x.split("_")[1] for x in cdict} - ham_goes.keys():
    ham_goes[whash] = "00"
print("minimum hams:", min_hams)
print("total minimums:", len(min_hams))
print(len(ham_goes))

sort_choice = "by_ham"
cdict = build_cdict(f"{data}/thumbs")
print("Done")

In [None]:
# Create the hist_goes grouping dict.

from sklearn.cluster import KMeans


sort_choice = ""
cdict = build_cdict(thumbs)
intersections = []
image_paths = []


for file in cdict:
    path = cdict[file]
    filepath = f"{path}/{file}"
    img = Image.open(filepath)
    hist = img.histogram()
    intersections.append(hist)
    image_paths.append(file)

table = []
for r in range(int(len(intersections) / 3), 2, -1):
    kmeans = KMeans(n_clusters=r, n_init="auto").fit(intersections)
    cluster_assignments = kmeans.labels_
    clusters = {}
    for i, assignment in enumerate(cluster_assignments):
        if assignment not in clusters:
            clusters[assignment] = []
        clusters[assignment].append(image_paths[i])
    counts = [(len(clusters[x])) for x in clusters]
    print(counts)
    if len([x for x in counts if x == 1]) <= 2:
        n = len(counts)
        break

for i, assignment in enumerate(cluster_assignments):
    if assignment not in clusters:
        clusters[assignment] = []
    clusters[assignment].append(image_paths[i])
hist_goes = {}
for label in clusters:
    cluster = clusters[label]
    for file in cluster:
        hist_goes[file] = str(label).zfill(3)

plt.bar(clusters.keys(), counts)
plt.xticks(rotation=90)
plt.title("Histogram Clusters")
plt.ylabel("Matches in Group")
plt.xlabel("Histogram Intersection Groups")
plt.show()
print("Done")

In [None]:
# Sort cats into minimum hamming-distance folders
def sort_it(by=""):
    for file in cdict:
        from_folder = cdict[file]
        parts = file.split("_")
        size, whash = parts[:2]

        by_dict = {
            "by_hist": hist_goes[file],
            "by_ham": ham_goes[whash],
            "by_size": size,
            "": "",
        }
        if sort_choice == "by_folder":
            to_folder = Path(file)
        else:
            try:
                to_folder = Path(f"{data}/thumbs/{by}/{by_dict[by]}")
            except:
                continue
        if not to_folder.is_dir():
            Path(to_folder).mkdir(parents=True, exist_ok=True)
        full_path = f"{from_folder}/{file}"
        dest_file = Path(f"{to_folder}/{file}")
        if not dest_file.is_file():
            dest = shutil.move(full_path, to_folder)
            cdict[file] = to_folder


print("Done")

In [None]:
# Sort thumbnails by similar color usage.
sort_choice = "by_hist"
cdict = build_cdict(thumbs)
sort_it(sort_choice)

In [None]:
cdict

In [None]:
# Sort thumbnails by size.
sort_choice = "by_size"
cdict = build_cdict(thumbs)
sort_it(sort_choice)

In [None]:
# Sort thumbnails by minimum hamming distances.
sort_choice = "by_ham"
cdict = build_cdict(thumbs)
sort_it(sort_choice)

In [None]:
# Move all thumbnails back to top-level.
sort_choice = ""
cdict = build_cdict(thumbs)
sort_it()

In [None]:
# We can drag things into folders to tag them.

sort_choice = "by_folder"
cdict = build_cdict(thumbs)
sort_it(sort_choice)
if not tagtable.exists():
    with open(tagtable, "wb") as fh:
        fh.write(dumps(tags))
else:
    with open(tagtable, "rb") as fh:
        existig_tags = loads(fh.read())
        [existig_tags.add(x) for x in tags]
    with open(tagtable, "wb") as fh:
        fh.write(dumps(existig_tags))
print("Done")

# We can use DataFrames to do our MatplotLibs.
df = pd.DataFrame(existig_tags, columns=["file", "tag"])
df.groupby("tag").count().plot(kind="bar").plot()

In [None]:
# Yes, the meta data is stplotin the PNG thumnails.
for i, cat in enumerate(cdict):
    file = f"{cdict[cat]}/{cat}"
    print(file)
    img = Image.open(file)
    meta = img.text
    for key in meta:
        print(f"{key}: {meta[key]}")
    print()
    if i >= 2:
        break  # Seen enough proof?
print("Done")