<a href="https://colab.research.google.com/github/telpirion/FantasyMaps/blob/main/notebooks/final/1_firestore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Storing training data in Firestore

{TODO: Update the links below.}

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to collect data from social media (e.g. Reddit), preprocess that data, and then store that data in a Firestore collection. In this scenario, you collect fictional, hand-drawn maps that are used for virtual role-playing games. (Later, you convert these maps into training data for a Vertex AI notebook.)


### Objective

In this tutorial, you learn how to create a dataset in Cloud Storage and Firestore from a third-party API.

This tutorial uses the following Google Cloud resources:

+ Cloud Storage bucket
+ Firestore collection

The steps performed include:

1. Collecting images from Reddit
1. Storing the images in Cloud Storage
1. Inferring training data from the images
1. Storing that training data in a Firestore collection

### Dataset

In this tutorial, you collect data from subreddit posts on Reddit. The individual posts are processed and their metadata stored in Firestore. Any image data the posts contain are extracted and stored in a Storage bucket.

### Costs

This tutorial uses billable components of Google Cloud:

* Firestore
* Cloud Storage

Learn about [Firestore pricing](https://cloud.google.com/firestore/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [1]:
%%writefile requirements.txt
google-cloud-firestore
google-cloud-secret-manager
google-cloud-storage
praw
pandas
numpy
spacy
pillow

Overwriting requirements.txt


In [2]:
! pip install --user --upgrade -qr requirements.txt

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.22.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.3 which is incompatible.
ibis-framework 7.1.0 requires numpy<2,>=1, but you have numpy 2.2.1 which is incompatible.
matplotlib 3.7.3 requires numpy<2,>=1.20, but you have numpy 2.2.1 which is incompatible.
numba 0.58.1 requires numpy<1.27,>=1.22, but you have numpy 2.2.1 which is incompatible.
scipy 1.11.4 requires numpy<1.28.0,>=1.21.6, but you have numpy 2.2.1 which is incompatible.
ydata-profiling 4.6.0 requires numpy<1.26,>=1.16.0, but you have numpy 2.2.1 which is incompatible.
ydata-profiling 4.6.0 requires pandas!=1.4.0,<2.1,>1.1, but you have pandas 2.2.3 which is incompatible.[0m[31m
[0m

We will also use a simple natural language parsing library to analyze posts. For this use case, we'll use the open source library [spaCy](https://spacy.io). spaCy requires that a language model be downloaded before it can be used.

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m111.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


We have prepared a library of functions to help you process images.

In [4]:
! pip uninstall -y fantasy_maps_lib
! pip install --user -q "git+https://github.com/telpirion/fantasy-maps-lib.git#egg=fantasy_maps_lib"

Found existing installation: fantasy_maps_lib 0.0.0
Uninstalling fantasy_maps_lib-0.0.0:
  Successfully uninstalled fantasy_maps_lib-0.0.0
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.22.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.3 which is incompatible.
ydata-profiling 4.6.0 requires numpy<1.26,>=1.16.0, but you have numpy 1.26.4 which is incompatible.
ydata-profiling 4.6.0 requires pandas!=1.4.0,<2.1,>1.1, but you have pandas 2.2.3 which is incompatible.[0m[31m
[0m

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Firestore API](https://console.cloud.google.com/flows/enableapi?apiid=firestore.googleapis.com).

4. [Enable the Secret Manager API](https://console.cloud.google.com/flows/enableapi?apiid=secretmanager.googleapis.com).

5. [Enable the Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com).

6. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
# If using Colab secrets, import them here
#from google.colab import userdata
#PROJECT_ID = userdata.get('PROJECT_ID')

#! gcloud config set project {PROJECT_ID}

In [5]:
# If running in Vertex Workbench or Colab Enterprise, set your project here
PROJECT_ID = !gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
print(PROJECT_ID)

fantasymaps-334622


In [None]:
# If running in some other Jupyter notebook installation, provide your
# Google Cloud project name
#PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
#! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [6]:
REGION = "us-west1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
#from google.colab import auth
#auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.


In [None]:
BUCKET_URI = f"gs://fantasy-maps-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
#! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries

In [7]:
from google.cloud import firestore
from google.cloud import secretmanager
from google.cloud import storage

from PIL import Image

import hashlib
import json
import math
import numpy as np
import os
import pandas as pd
import pprint
import praw
import re
import requests
import shutil
import spacy

In [8]:
from fantasy_maps.reddit import posts
from fantasy_maps.image import extract, shards
from fantasy_maps.image.image_metadata import ImageMetadata
from fantasy_maps.gcp.storage import store_image_gcs
from fantasy_maps.gcp.firestore import store_metadata_fs

### Get a Reddit API key

You need a [Reddit API key](https://www.reddit.com/wiki/api/) to access Reddit programmatically. Copy your API key into the labeled fields of the dictionary in the following cell.

**Note**: Once you have an API key, you must store it somewhere safe. It is recommended to store your API key as a JSON-formatted string in Cloud Secret Manager.

In [9]:
SECRET_RESOURCE_NAME = f"projects/{PROJECT_ID}/secrets/reddit-api-key/versions/1" # @param {type:"string"}

In [10]:
secret_client = secretmanager.SecretManagerServiceClient()
secret = secret_client.access_secret_version(name=SECRET_RESOURCE_NAME)
reddit_key_json = json.loads(secret.payload.data)
pprint.pprint(reddit_key_json)

{'client_id': 'Z0g7xbmKNB9Mew',
 'secret': '_XDRI2jgcVAJ6xKIWmA46yz8CZw',
 'user_agent': 'script:ScrapeForNLP:v1.0 (by u/Telpirion-78)',
 'user_name': 'Telpirion-78'}


## Query data (posts) on Reddit

Now that we have our API key ready for use, we can query Reddit for our data! In the next cell, we will read the top 100 "hot" posts from a subreddit.

For our use-case, we want to check the posts to see whether: 1) they have an image associated with them; and 2) the title gives us some clues as to the contents (e.g. columns and rows) contained in the image.

In [11]:
columns = ['Title', 'Post', 'ID', 'URL']

subreddit_name = "battlemaps"
reddit_posts = posts.get_reddit_posts(reddit_credentials=reddit_key_json,
                                subreddit_name=subreddit_name, limit=100)
reddit_dicts = posts.convert_posts_to_dicts(reddit_posts, columns)

Now that we have the top 100 "hot" posts from the subreddit, we're going to filter for only the posts that we want. Again, our criteria are: 1) must have an image; 2) the title must have the grid dimensions of the image.

We'll use Pandas to visualize the resulting data.

In [12]:
reddit_images = [
  ImageMetadata(
      url=d['URL'],
      title=d['Title'],
      rid=d['ID']) for d in reddit_dicts if (
    'jpeg' in d['URL'] and
    re.search(r'\d+x\d+', d['Title'])
  )
]
print(len(reddit_images))

34


## Process the images and their metadata

In this next step, we will process all of the Reddit posts with images posts:

1. Download the image itself
2. Parse each image's metadata
3. (If needed) Split the image into smaller images
4. Store the metadata in Firestore
5. Save the images on Cloud Storage

### Download the image

We're going to download the image locally. We'll need a meaningful filename to save the image. We also need a directory system to save the image to.

We also want to avoid downloading the same image more than once. We'll need to compare the images programmatically to verify that each image is unique.

The easiest way to do this will be to reduce each image to a unique hash value and then ensure that we never have two copies of the same hash value. For the sake of simplicity, we'll use these hash values as the unique ID for each image.

In [13]:
# If using Vertex Workbench or a local Jupyter notebook, run this cell.
root_dir = os.getcwd()

In [None]:
# If using Colab (not Colab Enterprise), mount Google Drive
#root_dir = '/content/drive'
#from google.colab import drive
#drive.mount(root_dir)

In [None]:
# If using Colab Enterprise, use gcsfuse
#root_dir = '/content/drive'

#!echo "deb https://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
#!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
#!apt -qq update && apt -qq install gcsfuse
#!gcsfuse --implicit-dirs $BUCKET_URI $root_dir

#BUCKET_NAME = BUCKET_URI.replace('gs://', '')
#root_dir = '/gcs/content'

#if not os.path.exists(root_dir):
#  os.makedirs(root_dir, exist_ok=True)

#!gcsfuse --implicit-dirs {BUCKET_NAME} {root_dir}

Next, download the images to a mounted drive. Once the image is downloaded, we will read its dimensions and store those dimensions as metadata.

In [14]:
local_reddit_data_dir = f'{root_dir}/reddit_maps_data'

if os.path.exists(local_reddit_data_dir):
  shutil.rmtree(local_reddit_data_dir)

os.mkdir(local_reddit_data_dir)

hashes = []

for post in reddit_images:
    url = post.url
    filename = posts.make_nice_filename(post.title)

    if filename == '':
        continue

    path = f'{local_reddit_data_dir}/{filename}'
    uid = extract.download_image_local(url=url, path=path)
    if uid in hashes:
      continue

    hashes.append(uid)

    post.path = path
    post.uid = uid

In [15]:
pprint.pprint(reddit_images[0].to_dict())

{'bboxes': [],
 'path': '/home/jupyter/FantasyMaps/notebooks/final/reddit_maps_data/cliffside_beach_battle_map.30x40.jpg',
 'rid': '1hnjots',
 'title': 'Cliffside Beach 30x40 battle map ',
 'uid': '98cfb16c9250a9926c94a7961eb54eebc658192b',
 'url': 'https://i.redd.it/hufv9eagcf9e1.jpeg',
 'vtt': {'cellHeight': 0,
         'cellWidth': 0,
         'cellsOffsetX': 0,
         'cellsOffsetY': 0,
         'imageHeight': 0,
         'imageWidth': 0}}


### Parse the image for metadata

All we know from the Reddit API is that these posts have JPGs associated with them and that they contain a substring in the format "NNxNN." However, we need more than just images and rough columns and rows for training a Vertex AI AutoML image object detection model. We would even need more data just to use these images in a VTT app.

To get valid training data and VTT data, we need to make some inferences about the images based upon the data that we have (or can get). The data we have are the number of columns and rows stated in the post's title (granted, these are not always accurate). The data we can get is the image's width and height. From these four data points, and assuming that they are accurate and that all cells in the map are uniform, we can infer the width and height of cells in the map.

Using the cell width and height, we can compute the rest of the data required for both an image object detection model and the data needed for a VTT app.

#### Virtual tabletop (VTT) data

The easiest for us to compute is the VTT data. This data provides us with the `cellWidth` and `cellHeight` data that will allow us to complete the ML training data.

The JSON structure of VTT data is:

```json
{
    "imageHeight": ##,
    "imageWidth": ##,
    "cellHeight": ##,
    "cellWidth": ##,
    "cellOffsetY": ##,
    "cellOffsetX": ##
}

```


#### Image object detection training data

AutoML image object detection on Vertex AI requires a JSONL file with information for the training data. Each line in the JSONL file needs to contain: the Cloud Storage URI of the image; and the bounding boxes of the objects (cells) that we want to train the model to identify on the image.

**Tip**: The format we use for training data is called the [Common Objects in Context (COCO)](https://cocodataset.org/#format-data), with some notable exceptions. The canonical example expresses the maximum values as `width` and `height`, where as we will annotate our data as `x` and `y` values expressed as a percentage of the total height and widthy of the image.

The structure of the JSON data in the JSONL file is:

```json
{
    "imageGcsUri": "URI",
    "boundingBoxAnnotations": {
        "displayName": "LABEL_NAME",
        "xMin": ##,
        "xMax": ##,
        "yMin": ##,
        "yMax": ##,
    }
}
```

For each bounding box, we need to provide a percentage value that expresses the vertices of the bounding box as a set of x and y pairs. Also, each bounding box needs to be given a label for that bounding box; all of our bounding boxes are "cells" so each one gets the label `cell`.

**Note**: A training image in Vertex AI can only have at most 500 bounding boxes; many fantasy maps have many more than 500 cells. So that we can use the most of the training data, we will split too large images until smaller images, or "shards", and use the shards  for training. We'll create the shards a bit later.

You can read more about how to format your training manifest JSONL file in [the documentation](https://cloud.google.com/vertex-ai/docs/datasets/prepare-image#object-detection).

In [16]:
test_img = reddit_images[0]
w, h = extract.get_image_width_and_height(test_img.path)
test_img.width = w
test_img.height = h

exp = '\.\d+x\d+\.'
match = re.search(exp, test_img.path)
if match:
    dims = match.group()

cols, rows = [int(d) for d in re.split(r'\.|x', dims) if d != '']
test_img.columns = cols
test_img.rows = rows

print(cols, rows)

30 40


In [17]:
for post in reddit_images:
    if not os.path.exists(post.path):
      continue

    local_path = post.path

    # Get width & height for original image
    w, h = extract.get_image_width_and_height(local_path)

    post.width = w
    post.height = h

    # Get columns & rows for original image, based upon the name.
    exp = '\.\d+x\d+\.'
    match = re.search(exp, post.path)
    if match:
        dims = match.group()

    cols, rows = [int(d) for d in re.split(r'\.|x', dims) if d != '']
    post.columns = cols
    post.rows = rows


    # If image doesn't need to be sharded, simply compute and continue
    if (cols * rows) <= 500:
        bboxes = extract.compute_bboxes(img_metadata=post)
        post.bboxes = bboxes
        continue

In [18]:
#test_images = [p for p in reddit_images if (len(p.bboxes) > 0)]
test_img = reddit_images[0]

#pprint.pprint(test_img)
print(test_img.bboxes)
print(len(reddit_images))
boxes = test_img.bboxes

bboxes = extract.compute_bboxes(img_metadata=test_img)
pprint.pprint(test_img)

[]
34
<fantasy_maps.image.image_metadata.ImageMetadata object at 0x7fac4f9e1d80>


Finally, we are going to make smaller versions of the big(ger) images that we have downloaded. First, we'll define smaller segments of these images. Then, we will create the smaller segments. Finally we will compute the bounding boxes for the resulting smaller images.

In [19]:
big_images = [p for p in reddit_images if (p.columns * p.rows) >= 500]
print(len(big_images))

27


In [None]:
NUM_SHARDS = 3
SHARD_COLS = 15
SHARD_ROWS = 15

for img_metadata in big_images:
    smaller_images = shards.compute_shard_coordinates(img_metadata=img_metadata,
                                                      num_shards=NUM_SHARDS,
                                                      shard_cols=SHARD_COLS,
                                                      shard_rows=SHARD_ROWS)
    for sm_img in smaller_images:
        print(img_metadata.to_dict())
        print(img_metadata.path)
        print(sm_img)
        shard_metadata = shards.create_shard(x_min=sm_img[0],
                                             y_min=sm_img[1],
                                             x_max=sm_img[2],
                                             y_max=sm_img[3],
                                             cols=sm_img[4],
                                             rows=sm_img[5],
                                             parent_img=img_metadata)

        if shard_metadata is None:
            continue

        bboxes = extract.compute_bboxes(img_metadata=shard_metadata)
        shard_metadata.bboxes = bboxes
        reddit_images.append(shard_metadata)

Finally, now that we have all of the VTT and image object detection metadata computed for the original images and/or shards, we can join the two dataframes together into one. As part of this process, we also want to reindex the resulting dataframe so that it uses the UIDs we calculated instead of the automatically generated indices.

In [21]:
print(len(reddit_images))

115


## Store the images in Google Cloud Storage

Now that we've created the image shards, we can begin uploading the images to Google Cloud Storage. We'll need to have a Storage bucket already created for this next cell of code.

In [None]:
help(store_image_gcs)

In [22]:
BUCKET = 'fantasy-maps'
for img in reddit_images:
    gcs_uri = store_image_gcs(project_id=PROJECT_ID, img_metadata=img,
                                bucket_name=BUCKET, prefix="FantasyMapsTest")
    img.gcs_uri = gcs_uri

## Store the metadata in Firestore

Next we're going to store all of this metadata and URI in Firestore. The benefit of using Firestore is that the fields with JSON-formatted strings-`VTT` and `BBoxes` will automatically be translated into the correct document structure in Firestore after they've been upserted.

Very, very last step: iterate over all the training data and store the metadata in the Firestore collection.

In [23]:
COLLECTION_NAME = 'FantasyMaps2'
for img in reddit_images:
    store_metadata_fs(project_id=PROJECT_ID, img_metadata=img,
                      collection_name=COLLECTION_NAME)


## Check the results of the metadata creation

Now that (hopefully) all of the image metadata has been added to the Firestore collection, we can review the data to ensure that it is correct.

To do this, we'll review the documents stored in the Firestore collection to verify that it has all the data we need--the VTT data, bounding boxes, and the GCS URI of the image.

In [25]:
client = firestore.Client(project=PROJECT_ID)
collection = client.collection(COLLECTION_NAME)

docs = collection.where("bboxes", "!=", "").select(field_paths=[
    "gcs_uri", "filename", "vtt", "parent_uid"]).stream()

With this Firestore query, we can verify the image metadata against the stored image in the Storage bucket. We'll first take the results of this query, compose it into a `pandas.DataFrame` object, and then print it out to the cell output. We can first take a look at the parent map (assuming that the map has been sharded) and then conclude whether the map and all its shards should be removed from the training set.

In [27]:
docs_list = ((d.to_dict(), d.id) for d in docs)
docs_df = pd.DataFrame()
for i, d in enumerate(docs_list):
    d_dict = d[0]
    vtt = d_dict["vtt"]
    d_dict["vtt"] = json.dumps(vtt)
    d_dict["uid"] = d[1]
    docs_df = pd.concat([docs_df, pd.DataFrame(data=d_dict, index=[0])], ignore_index=True)

In [28]:
docs_df.set_index("uid", inplace=True)
check_set_df = docs_df[["filename", "gcs_uri", "parent_uid"]]
check_set_df.head(10)

Unnamed: 0_level_0,filename,gcs_uri,parent_uid
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1eff1c4e52adba942c719fe30daaf820684da683,nephalia_port_town.32x56.jpg,gs://fantasy-maps/FantasyMapsTest/nephalia_por...,
23af0ee24b932fad8dac020de678cca76021c6c1,frozen_caverns_free_map_more_hd.57x82.jpg,gs://fantasy-maps/FantasyMapsTest/frozen_caver...,
259cf5bb799baf15790534d41e71c7773f8a0eb7,tsolenka_pass_iii_curse_strahd_animated.32x18.jpg,gs://fantasy-maps/FantasyMapsTest/tsolenka_pas...,
2936d487ac34d309563e47cf03fca301abc7c8df,mordhin_memorial.40x40.jpg,gs://fantasy-maps/FantasyMapsTest/mordhin_memo...,
38d4408fd12a17e9382d219f7fdb7d1c34b52da1,tsolenka_pass_iv_curse_strahd_animated.32x18.jpg,gs://fantasy-maps/FantasyMapsTest/tsolenka_pas...,
4653b32134cebfd0f0eadcd848a289ea135e4914,winter_wonderland_fey_ballroom_night_.38x56.jpg,gs://fantasy-maps/FantasyMapsTest/winter_wonde...,
4f6aa901d0ecd2e01aa08cb240b04204215fdfdf,winter_solstice_feast_vikings_hall_.18x28.jpg,gs://fantasy-maps/FantasyMapsTest/winter_solst...,
4f6ed9f919188bbba8cc344e23dc9a9e796f91bd,tsolenka_pass_v_curse_strahd_animated.32x18.jpg,gs://fantasy-maps/FantasyMapsTest/tsolenka_pas...,
502067f9ecdc34a7047180ff6107c75a796cb53a,forgotten_king_tomb_40x100_battle_map.40x100.jpg,gs://fantasy-maps/FantasyMapsTest/forgotten_ki...,
52bd1dd7b62da83e41f2964fd79687c478b76392,river_guard_battle_map.30x80.jpg,gs://fantasy-maps/FantasyMapsTest/river_guard_...,


Not everyone on Reddit follows the same conventions. Sometimes, there might be be a post where there are dimensions mentioned in the post (e.g. "50x40"), but the image doesn't actually have gridlines.

We shouldn't allow these images into the training and test dataset for our model. Unfortunately, we have to review the images that we've collected on GCS and then verify that they do (or don't!) have gridlines visually.


We'll start by printing out the entirety of our `DataFrame`.

In [29]:
pd.set_option("display.max_rows", 1000)
check_set_df.sort_values(by="filename", ascending=True, inplace=True)
display(check_set_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  check_set_df.sort_values(by="filename", ascending=True, inplace=True)


Unnamed: 0_level_0,filename,gcs_uri,parent_uid
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c02113c8c4f50faf141ba53129836c277f421b44,bandit_camp_ravine_part.0_2240.15x15.22x34.jpg,gs://fantasy-maps/FantasyMapsTest/bandit_camp_...,04debbbd5aaa4b4e498fcb0e1ceee1b2d542ef40
c53292009a9b6edc4373d7d89cb9ac1cfd3bbc5b,bandit_camp_ravine_part.560_2100.15x15.22x34.jpg,gs://fantasy-maps/FantasyMapsTest/bandit_camp_...,04debbbd5aaa4b4e498fcb0e1ceee1b2d542ef40
2c246bc4e3db9ae23e7283b2eff1dd4339d54a81,bandit_camp_ravine_part.980_1540.15x15.22x34.jpg,gs://fantasy-maps/FantasyMapsTest/bandit_camp_...,04debbbd5aaa4b4e498fcb0e1ceee1b2d542ef40
3eae006c7aa9c17f0e4c111d9cb279e408974fd9,candy_cane_knoll.16x22.jpg,gs://fantasy-maps/FantasyMapsTest/candy_cane_k...,
fe49db70b3e37ea4b2b89622a6eb64a213c74bc3,cliffside_beach_battle_map.200_500.15x15.30x40...,gs://fantasy-maps/FantasyMapsTest/cliffside_be...,98cfb16c9250a9926c94a7961eb54eebc658192b
98cfb16c9250a9926c94a7961eb54eebc658192b,cliffside_beach_battle_map.30x40.jpg,gs://fantasy-maps/FantasyMapsTest/cliffside_be...,
d63c70bea549942acec9376cc1c33de2aef60fb8,cliffside_beach_battle_map.400_300.15x15.30x40...,gs://fantasy-maps/FantasyMapsTest/cliffside_be...,98cfb16c9250a9926c94a7961eb54eebc658192b
1efe35d1a7954603b7eb3a4f91499ac71211b4fa,cliffside_beach_battle_map.750_450.15x15.30x40...,gs://fantasy-maps/FantasyMapsTest/cliffside_be...,98cfb16c9250a9926c94a7961eb54eebc658192b
ac307d84b7467bd122dc46393613e5ca99669a35,crystals.102_380.15x15.30x22.jpg,gs://fantasy-maps/FantasyMapsTest/crystals.102...,c0e58bffc60ba929424b2dbb916bbbe8fe5caa72
6497da9b7f1c1a9e717aec5806becbc1d53a487e,crystals.1428_190.15x15.30x22.jpg,gs://fantasy-maps/FantasyMapsTest/crystals.142...,c0e58bffc60ba929424b2dbb916bbbe8fe5caa72


This final step of data prepartion is to mark all of the unusable images in the Firestore collection. Luckily, we can use the Google Cloud Console to view the contents of our Storage bucket. We can even add new fields to the documents in our Firestore collection!

![Storage user interface in the Cloud Console](https://github.com/telpirion/FantasyMaps/blob/main/notebooks/final/resources/StorageUI.png?raw=1)
_Figure. The Google Cloud Storage user interface, showing images in a bucket._

![Firestore user interface in the Cloud Console](https://github.com/telpirion/FantasyMaps/blob/main/notebooks/final/resources/FirestoreUI.png?raw=1)
_Figure. The Cloud Firestore user interface, showing a "Usable" field being added to a document._

For this very last data preparation step, we will visually inspect all of the "parent" images in the Cloud Storage bucket. We will then create a list unusable images, where we store the image's UID. Finally, we will do a bulk update job to Firestore, setting a `Usable` field on the images to `False`.

In [None]:
unusable_parent_images = [
    "d3ee0039aeff5c33de778c5adbdd000f21c0b4cd",
    "9a7e82433239b0087121f6fd31e133f5a94fa7dc",
    "fc70f330cd5a3aaa3be94e0d603dab2876f9fca1",
    "890c0d27318b0286aebf67c392fab286bdc4e7c5",
    "2f4f496466d6e9fb8ad9791ccb81f7c13fd407db",
    "4c0d3eb86ed599f496f7e30e18025021aebfe153",
    "4bf88b8acc669331a65465e8c4b37fd8b9495e4b",
    "b15fe8185c30b3e7800e42e280a3792998a0b55f",
    "5cd7bbd5882fe8b9a270be1aa911c0ba858e818d",
    "67925948371de53b58c09b32c97af60f72c58e0f",
    "a816b6eb31b71cad5b50531d6e18ef46cb451cbd",
    "f00be2ed7b19b8d0f44915a1886437193aa224aa",
    "890c0d27318b0286aebf67c392fab286bdc4e7c5",
    # Others ....
]

usable_parent_images = [
    "7a903be0bad0fc00bfabbefd682b6eef23263b67",
    "b1fe854b9c658cf9312319a447b594743513499b",
    "3a52781685d7aed530a685739b58341fafd2e721",
    "3d17d612a843af7a5ad1a2b2d5dcce29e67d367f",
    "aa33a7ed5c3c87147fd25dafe6c0a1d3eb29dbe5",
    "2ce9f80408137e36531581fb22ee3fe892f41f76",
    "e086c1cf420e27448bc1a45147b5c43df4b3d8d0",
    "5bc994d7d38c1fc4654c58666d674200f731986b",
    "820c3bbfe4d14694ecb729ce3f45b4bda031f61a",
    "2d323018c74db7e0432ff368283ea429f13bd36d",
    "343833ab1d8cd17dae6b702830864500d1e66e19",
    "c60b952d0c20a63dc263220ec5b49a54fd20d175",
    "8ac01d3a84bc548a8b243d08c6e031206f293908",
    "f00be2ed7b19b8d0f44915a1886437193aa224aa",
    "950390c88b7bd9d6f886e5f01bae9460c0aa407b",
    "3322c1b4795e4a9e4477feafd55685d647b4e29c",
    "34362be27ada680a58e42d26758bf08c01d3460a",
    "60d815ba2c1458c3a4039595a5ed723a7501a36e",
    "928b339ba222a3933fce4523f8033fa3eb7ed62f",
    "4931f0033f0ab217fd0fe2a2024d22a119782e2c",
    "83846604933daff160cefc28c2a828bd93a84e1a",
    "d2fe7281d8b1e043009033c957ca347847343e14",
    "c1552a0046ef4ece5d146544043edb2deb97f7a1",
    # And more ...
]

**Note**: If you're thinking that this manual process should be automatable--you're right! In another notebook, we will use a pre-trained version of our gridline-detecting model to accept or reject images.

In [None]:
usable_set = set(usable_parent_images)
unusable_set = set(unusable_parent_images)

In [None]:
firestore_client = firestore.Client(project=PROJECT_ID)
bulkwriter = firestore_client.bulk_writer()
collection = firestore_client.collection(COLLECTION_NAME)

In [None]:
# Iterate over all of the metadata entries & images that we want to delete
unusable_shards = collection.where("Parent", "in", list(unusable_set)[:10]).stream()
for doc in unusable_shards:
    bulkwriter.delete(doc.reference)

bulkwriter.flush()

In [None]:
# Iterate over all of the good entries
subset_start_index = 0
while subset_start_index < len(usable_set):
    subset = list(usable_set)[subset_start_index:subset_start_index + 10]
    usable_shards = collection.where("Parent", "in", subset).stream()

    for doc in usable_shards:
        bulkwriter.update(doc.reference, { "Usable": True})

    subset_start_index = subset_start_index + 10

bulkwriter.flush()

In [None]:
bulkwriter.close()