Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Add images from the Toronto Public Library #24

Merged
merged 5 commits into from
Jun 4, 2018
Merged

Conversation

danvk
Copy link
Contributor

@danvk danvk commented Jun 1, 2018

This adds 3,991 images from the Toronto Public Library to OldTO. It excludes images from the Toronto Star, which have a more restrictive copyright. (Including these would get us 3-4x more images.)

We use the same geocoding techniques as for the Toronto Archives and run a separate pipeline to generate another images.geojson file. The final step merges data/toronto-archives/images.geojson + data/tpl/images.geojsondata/images.geojson.

Changes:

  • Move Toronto Archives-specific data into data/toronto-archives
  • Add Toronto Public Library data to data/tpl.
  • Add some source-specific logic to generate_geocodes.py
  • Add some custom UI code to properly attribute images.
  • Add a timeout to geocoding (some of the regexes have pathological behavior for image titles with lots of capital letters in them).
  • Fix scripts/check_data.sh; I'm not sure if it ever worked before.

There are no diffs between data/images.geojson on master and data/toronto-archives/images.geojson on this branch. The changes are entirely to add new images from the TPL.

@DOsinga There's now a tpl_fields entry in feature properties for some images. I'm not sure if this will affect the API server database. I only tested with the dev server.
cc @mebreuer

TBD: do we want to host these out of GCS? The medium-size versions are in, e.g. gs://sidewalk-old-toronto/toronto-public-library/DC-964-6-43.jpg.

TPL image:
image

Toronto Archives image:
image

return (
'TSPA' in record['license'] or
'-TS-' in record['uniqueID'] or
'Toronto Star' in (record['provenance'] or []) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the record contain Nones or can we get away with .get(...)
either way you could collapse the last two lines with an or

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It contains Nones. Not sure I follow the comment about the or. I added a comment.

num_star += 1
continue
num_output += 1
print(row, end='')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to leave this print statement in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If I take it out, the program won't do anything! (It's a line filter.)


SOURCE_TPL = 'tpl'
SOURCE_ARCHIVES = 'toronto-archives'
SOURCES = {SOURCE_TPL, SOURCE_ARCHIVES}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer all constants on top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -155,7 +216,7 @@ def load_patch_csv(patch_csv):

image_url = record.get('imageLink')
assert image_url
dims = path_to_size.get(os.path.basename(image_url))
dims = path_to_size.get(os.path.basename(image_url)) or path_to_size.get(id_ + '.jpg')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the second clause of the or be the default for the first get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



if __name__ == '__main__':
urls_file_input, ndjson_output = sys.argv[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and above, it might be overkill, but I'd consider using argparse

@danvk danvk merged commit 13c1352 into master Jun 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants