# Fantasy maps Reddit scraper

This notebook is used to pull gridded map images from several subreddits. The images are then stored in a Google Cloud Storage bucket

## START HERE: Set up the GCP environment

In [4]:
! pip install --user google-cloud-secret-manager google-cloud-storage google-cloud-aiplatform==1.3.0 praw pandas numpy

Collecting google-cloud-secret-manager
  Downloading google_cloud_secret_manager-2.9.2-py2.py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 4.1 MB/s             
Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
     |████████████████████████████████| 176 kB 31.8 MB/s            
Collecting google-api-core[grpc]<3.0.0dev,>=1.26.0
  Downloading google_api_core-2.7.1-py3-none-any.whl (114 kB)
     |████████████████████████████████| 114 kB 59.4 MB/s            
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: google-api-core, update-checker, prawcore, praw, google-cloud-secret-manager
  Attempting uninstall: google-api-core
    Found existing installation: google-api-core 1.31.4
    Uninstalling google-api-core-1.31.4:
      Successfully uninstalled google-api-core-1.31.4
[31mERROR: pip's depen

Set the `PROJECT_ID` variable from the environment.

In [2]:
import os

PROJECT_ID = !gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
print(PROJECT_ID)


fantasymaps-334622


## Get the Reddit API key out of Secret Manager

In [9]:
from google.cloud import secretmanager
import json

client = secretmanager.SecretManagerServiceClient()

secret_resource_name = f"projects/{PROJECT_ID}/secrets/reddit-api-key/versions/1"
response = client.access_secret_version(request={"name": secret_resource_name})

payload = response.payload.data.decode("UTF-8")
reddit_key_json = json.loads(payload)

print(reddit_key_json)

{'secret': '_XDRI2jgcVAJ6xKIWmA46yz8CZw', 'client_id': 'Z0g7xbmKNB9Mew', 'user_agent': 'script:ScrapeForNLP:v1.0 (by u/Telpirion-78)', 'user_name': 'Telpirion-78'}


## Get a batch of images from a subreddit

First, connect to Reddit.

In [10]:
import praw
import numpy as np
import pandas as pd

reddit = praw.Reddit(client_id=reddit_key_json['client_id'], 
                     client_secret=reddit_key_json['secret'],
                     user_agent=reddit_key_json['user_agent'])
print(f'Reddit is in read-only mode: {reddit.read_only}')

Reddit is in read-only mode: True


Next, download the top 1000 "hot" posts for one of the subreddits.

In [11]:
nan_value = float("NaN")
subreddit_name = 'fantasymaps'
posts = reddit.subreddit(subreddit_name).hot(limit=1000)

filtered_posts = [[s.title, s.selftext, s.id, s.url] for s in posts]
filtered_posts = np.array(filtered_posts)
reddit_posts_df = pd.DataFrame(filtered_posts,
                               columns=['Title', 'Post', 'ID', 'URL'])

reddit_posts_df.head(10)

Unnamed: 0,Title,Post,ID,URL
0,[meta] Zero Tolerance on Shitty Behavior,Heads up folks. I have a zero tolerance policy...,fp0po8,https://www.reddit.com/r/FantasyMaps/comments/...
1,Sub Updates and Adjustments,"Hello Everyone!\n\n**First, I want to thank yo...",s26g1s,https://www.reddit.com/r/FantasyMaps/comments/...
2,How could I improve my map?,,tttznt,https://i.redd.it/rjlwtsr6vxq81.jpg
3,Cave Mine [40x40] [Battlemap],,ttp1dc,https://i.redd.it/qu7yexqvqwq81.jpg
4,Ice dungeon [40x55],,tto5f6,https://i.redd.it/pisgx1trhwq81.jpg
5,Chapel [32x44],,ttl8t5,https://i.redd.it/e2bw1tf4gvq81.jpg
6,Multiplanar battle. Ritual gone wrong.,,ttsk0z,https://www.reddit.com/gallery/ttsk0z
7,"World Map of El'kora (No Markers, Free to Cust...",,tts2mv,https://www.reddit.com/gallery/tts2mv
8,Adding more adventure features to my hexagons,,ttnymu,https://i.redd.it/p6ti380qfwq81.png
9,A map I've been working on for a little bit! s...,,tty9yl,https://i.redd.it/n6zura2rqyq81.png


Filter for only posts that contain JPG images.

In [12]:
jpg_df = reddit_posts_df.loc[reddit_posts_df["URL"].str.contains("jpg")]
jpg_df.head(10)
#print(jpg_df.shape)

Unnamed: 0,Title,Post,ID,URL
2,How could I improve my map?,,tttznt,https://i.redd.it/rjlwtsr6vxq81.jpg
3,Cave Mine [40x40] [Battlemap],,ttp1dc,https://i.redd.it/qu7yexqvqwq81.jpg
4,Ice dungeon [40x55],,tto5f6,https://i.redd.it/pisgx1trhwq81.jpg
5,Chapel [32x44],,ttl8t5,https://i.redd.it/e2bw1tf4gvq81.jpg
10,[Battlemap] Moonlit Giant Throne Room (36x50),,tu1esg,https://i.redd.it/nh9nwinkezq81.jpg
11,I hear this sub likes watermarks [30x30],,ttny8f,https://i.imgur.com/L7OJQXU.jpg
12,Trenches [Battlemap][3072x3072px],,ttuag2,https://i.redd.it/u7ffum0hxxq81.jpg
13,A city fit for a Pirate King! Shipwreck City [...,,tt5hfu,https://i.redd.it/raq29tp2crq81.jpg
14,[2100x3780] [30x54] Dragon Meeting Temple [Bat...,,ttwrrj,https://i.redd.it/9q8a8rkifyq81.jpg
17,[25x35] Acient Ruins [Forest][Battlemap][Ruins],,tsz644,https://i.redd.it/b9t0vo9uupq81.jpg


Download just the first image as a test

In [13]:
import os
import re
import requests
import shutil

regex = "[\s|\(|\"|\)]"

test_image = jpg_df.head(1).URL.item()
#test_file_name = jpg_df.head(1).Title.item().replace(" ", "_").lower() + ".jpg"
test_file_name = jpg_df.head(1).Title.item()
test_file_name = re.sub(regex, "_", test_file_name)
test_file_name = test_file_name.lower()[:100]

print(test_image)
print(test_file_name)

https://i.redd.it/rjlwtsr6vxq81.jpg
how_could_i_improve_my_map?


In [14]:
r = requests.get(test_image, stream=True)
if r.status_code == 200:
    r.raw.decode_content = True
    with open(test_file_name, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

In [15]:
def make_nice_filename(name):
    """Create a nice file name.
    TODO(telpirion):
       + Condense multiple underscores to single underscores
       + Reduce maximum filename length to 30 char
       + Add brackets [] to regex
    """
    regex = "[\s|\(|\"|\)]"
    new_name = re.sub(regex, "_", name)
    new_name = new_name.lower()[:30]
    return f"{new_name}.jpg"

Download first 50 images and put them into a local directory.

In [16]:
local_reddit_data_dir = "reddit_maps_data"

if not os.path.exists(local_reddit_data_dir):
    os.mkdir(local_reddit_data_dir)
    
for index, row, in jpg_df.head(50).iterrows():
    image_url = row["URL"]
    image_filename = make_nice_filename(row["Title"])
    
    r = requests.get(image_url, stream=True)
    if (r.status_code == 200):
        r.raw.decode_content = True
        with open(f"{local_reddit_data_dir}/{image_filename}", "wb") as f:
            shutil.copyfileobj(r.raw, f)

## Review downloaded images for prediction

The images downloaded from Reddit contain a mix of gridded images and ungridded images. Unfortunately, the easiest way to determine which are gridded (and which aren't) is through manual inspection.

The "image_reviewer.ipynb" notebook can be used to review images. Once the set of images to use for batch prediction are ready, create a list of the images to upload.

In [119]:
"""
gridded_images = [
    "reddit_maps_data/[oc]_rocky_ruins_battlemap[48x48].jpg",
    "reddit_maps_data/48__x_48__text_to_map_prototype_of_a_jungle_road_&_camp.jpg",
    "reddit_maps_data/airship's_crash_site_[32x46][desert].jpg",
    "reddit_maps_data/canal_city_battle_map_30x30.jpg",
    "reddit_maps_data/city_drawbridge_[32x44].jpg",
    "reddit_maps_data/crypt_of_the_hellriders.jpg",
    "reddit_maps_data/desert_oasis_town_battlemap_[36x30].jpg",
    "reddit_maps_data/gridlock_vault_[part_03]_[33x20].jpg",
    "reddit_maps_data/just_a_small_clearing_filled_with_the_energies_of_a_long-forgotten_water_goddess_[oc][art][battle_ma.jpg",
    "reddit_maps_data/mountain_outpost_[battlemap][oc][22x33][1540x2310].jpg",
    "reddit_maps_data/nest_[25x30].jpg",
    "reddit_maps_data/tunnel_cave.jpg",
    "reddit_maps_data/what_do_you_mean?_did_you_pay_30_gold_for_a_bridge?_what_bridge_costs_30_gold?_small__bridge__encoun.jpg",
]
"""

dataset_files = [f for d, o, f in os.walk('reddit_maps_data')]
dataset_files = dataset_files[0]
dataset_files = sorted(dataset_files)

#print(dataset_files)

gridded_images = [f"reddit_maps_data/{i}" for i in dataset_files]
print(gridded_images)

['reddit_maps_data/48__x_48__text_to_map_prototype_of_a_jungle_road_&_camp.jpg', 'reddit_maps_data/[22x17]_wild_rapids_[battlemap.jpg', 'reddit_maps_data/[24x36]_teleporter_portals_[fo.jpg', 'reddit_maps_data/[25_x_41]_[6000_x_10000px]_fey.jpg', 'reddit_maps_data/[25x30]_nest_[cave][battlemap].jpg', 'reddit_maps_data/[26x39]_fireflies_at_the_cross.jpg', 'reddit_maps_data/[36x30]_desert_oasis_battlemap.jpg', 'reddit_maps_data/[battlemap][30x30][2160x2160px.jpg', 'reddit_maps_data/[battlemap]_the_hanging_tree_[.jpg', 'reddit_maps_data/[battlemap]_the_river_--_[oc]_.jpg', 'reddit_maps_data/[battlemap]_town_infirmary_[25.jpg', 'reddit_maps_data/[battlemap]_underwater_temple_.jpg', "reddit_maps_data/[oc]_airship's_crash_site_[32x.jpg", 'reddit_maps_data/[oc]_red_rocks_[battlemap][53x.jpg', 'reddit_maps_data/[oc]_rocky_ruins_[battlemap][4.jpg', "reddit_maps_data/airship's_crash_site_[32x46][desert].jpg", 'reddit_maps_data/city_drawbridge_[32x44].jpg', 'reddit_maps_data/city_port_docks_[55x40

## Create the batch prediction job

In [120]:
import os
from datetime import datetime

from google.cloud import aiplatform as aip
from google.cloud import storage

aip.init(project=PROJECT_ID, location="us-central1")
storage_client = storage.Client(project=PROJECT_ID)

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BATCH_PREDICTION_BUCKET = "video-erschmid"
BATCH_PREDICTION_PREFIX = f"DnD-batch-predict-input/{TIMESTAMP}"
BATCH_PREDICTION_URI = f"gs://{BATCH_PREDICTION_BUCKET}/{BATCH_PREDICTION_PREFIX}"
MODEL_ID = "5982671465147793408"

os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
input_file_uri = f"{BATCH_PREDICTION_PREFIX}/batch-prediction-input.jsonl"

print(BATCH_PREDICTION_BUCKET)
print(PROJECT_ID)
print(input_file_uri)

video-erschmid
video-erschmid
DnD-batch-predict-input/20211005225454/batch-prediction-input.jsonl


### Upload local files up to the bucket

In [121]:
bucket = storage_client.bucket(BATCH_PREDICTION_BUCKET)

for gridded_image in gridded_images:
    filename = gridded_image.split("/")[-1]
    print(filename)
    file_blob = bucket.blob(f"{BATCH_PREDICTION_PREFIX}/{filename}")
    file_blob.upload_from_filename(gridded_image)

48__x_48__text_to_map_prototype_of_a_jungle_road_&_camp.jpg
[22x17]_wild_rapids_[battlemap.jpg
[24x36]_teleporter_portals_[fo.jpg
[25_x_41]_[6000_x_10000px]_fey.jpg
[25x30]_nest_[cave][battlemap].jpg
[26x39]_fireflies_at_the_cross.jpg
[36x30]_desert_oasis_battlemap.jpg
[battlemap][30x30][2160x2160px.jpg
[battlemap]_the_hanging_tree_[.jpg
[battlemap]_the_river_--_[oc]_.jpg
[battlemap]_town_infirmary_[25.jpg
[battlemap]_underwater_temple_.jpg
[oc]_airship's_crash_site_[32x.jpg
[oc]_red_rocks_[battlemap][53x.jpg
[oc]_rocky_ruins_[battlemap][4.jpg
airship's_crash_site_[32x46][desert].jpg
city_drawbridge_[32x44].jpg
city_port_docks_[55x40][oc][ba.jpg
desert_oasis_bazaar_battlemap_.jpg
desert_oasis_town_battlemap_[3.jpg
free_map_friday_from_grim_pres.jpg
gridlock_vault_[part_03]_[33x20].jpg
halls_of_the_spider_queen_[bat.jpg
just_a_small_clearing_filled_with_the_energies_of_a_long-forgotten_water_goddess_[oc][art][battle_ma.jpg
mountain_outpost_[battlemap][oc][22x33][1540x2310].jpg
tartarus_

### Create the batch prediction input file

In [122]:
blobs = bucket.list_blobs()
input_file_data = []

for blob in blobs:
  if ((blob.name.find(str(TIMESTAMP)) > -1) and
      (blob.name.find("jpg") > -1)):
    print(blob.name)

    # Add the data to store in the JSONL input file.
    tmp_data = {"content": f"gs://{BATCH_PREDICTION_BUCKET}/{blob.name}", "mimeType": "image/jpeg"}
    input_file_data.append(tmp_data)

input_str = "\n".join([str(d) for d in input_file_data])
file_blob = bucket.blob(input_file_uri)
file_blob.upload_from_string(input_str)

DnD-batch-predict-input/20211005225454/48__x_48__text_to_map_prototype_of_a_jungle_road_&_camp.jpg
DnD-batch-predict-input/20211005225454/[22x17]_wild_rapids_[battlemap.jpg
DnD-batch-predict-input/20211005225454/[24x36]_teleporter_portals_[fo.jpg
DnD-batch-predict-input/20211005225454/[25_x_41]_[6000_x_10000px]_fey.jpg
DnD-batch-predict-input/20211005225454/[25x30]_nest_[cave][battlemap].jpg
DnD-batch-predict-input/20211005225454/[26x39]_fireflies_at_the_cross.jpg
DnD-batch-predict-input/20211005225454/[36x30]_desert_oasis_battlemap.jpg
DnD-batch-predict-input/20211005225454/[battlemap][30x30][2160x2160px.jpg
DnD-batch-predict-input/20211005225454/[battlemap]_the_hanging_tree_[.jpg
DnD-batch-predict-input/20211005225454/[battlemap]_the_river_--_[oc]_.jpg
DnD-batch-predict-input/20211005225454/[battlemap]_town_infirmary_[25.jpg
DnD-batch-predict-input/20211005225454/[battlemap]_underwater_temple_.jpg
DnD-batch-predict-input/20211005225454/[oc]_airship's_crash_site_[32x.jpg
DnD-batch-pre

### Create the batch prediction job

In [123]:
from google.cloud.aiplatform import jobs

job_display_name = f"maps-batch-predict-{TIMESTAMP}"
model = aip.Model(model_name=f"projects/{PROJECT_ID}/locations/us-central1/models/{MODEL_ID}")

batch_prediction_job = model.batch_predict(
    job_display_name=job_display_name,
    gcs_source="gs://video-erschmid/DnD-batch-predict-input/20210930231618/batch-prediction-input.jsonl",
    #f"gs://{BATCH_PREDICTION_BUCKET}/{input_file_uri}",
    gcs_destination_prefix=f"gs://{BATCH_PREDICTION_BUCKET}/{BATCH_PREDICTION_PREFIX}/output",
    sync=True,
)

INFO:google.cloud.aiplatform.jobs:Creating BatchPredictionJob
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob created. Resource name: projects/147301782967/locations/us-central1/batchPredictionJobs/9196449395137052672
INFO:google.cloud.aiplatform.jobs:To use this BatchPredictionJob in another session:
INFO:google.cloud.aiplatform.jobs:bpj = aiplatform.BatchPredictionJob('projects/147301782967/locations/us-central1/batchPredictionJobs/9196449395137052672')
INFO:google.cloud.aiplatform.jobs:View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/batch-predictions/9196449395137052672?project=147301782967
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/147301782967/locations/us-central1/batchPredictionJobs/9196449395137052672 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/147301782967/locations/us-central1/batchPredictionJobs/9196449395137052672 current state:
JobState.JOB_STAT