# Data Ingestion of OpenDataPhilly Complete Street Centerlines

Add raw data to project and manually create metadata for the upstream source.

Each dataset gets its own research object to track them individually without creating a very large, interconnected research object for the whole project.

## Dataset Descriptions

Use attributes and entities as needed

- Attributes
    - name
    - publisher {"@id": "#local_id"} or {"@id": "url"}
    - creater {"@id": "#person"}
    - license
    - datePublished
    - keywords (ex. "streets, aggregated features, map")
    - mainEntity Output Dataset 
- Data Entities
    - File
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
        - sha256
    - Dataset
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
    - ComputationalWorkflow
        - @type (pre-defined with built-in models) ["File", "SoftwareSourceCode", "ComputationalWorkflow"]
        - source (File)
        - author
        - programmingLanguage
        - input [{"@id": "#id1"}, {"@id": "#id2"}]
        - output [{"@id": "#id3"}]
- Contextual Entities
    - Publisher
        - @type Organization
        - name
        - url
    - Person
        - name
    - programmingLanguage
        - @type ["ComputerLanguage", "SoftwareApplication"]
        - name
        - version
        - url
    - encodingFormat
        - @id PRONOM url
        - @type Website
        - name
    - input
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToUpstreamDataset"}
        - encodingFormat
        - valueRequired True
    - output
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToDownstreamDataset"}
        - encodingFormat
        - valueRequired True

In [1]:
from datetime import date
from rocrate.rocrate import ROCrate
from rocrate.model import (
    Person,
    File,
    Dataset,
    ComputationalWorkflow,
    ContextEntity,
)

In [2]:
crate = ROCrate()

### Main Attributes

- name
- publisher {"@id": "#local_id"} or {"@id": "url"}
- creater {"@id": "#person"}
- license
- datePublished
- keywords (ex. "streets, aggregated features, map")
- mainEntity Output Dataset

Contextual

- Publisher
    - @type Organization
    - name
    - url

In [3]:
name = "Philadelphia Street Centerlines"
short_name = "OpenDataPhillyStreets"

In [4]:
dept_planning = crate.add(
    ContextEntity(
        crate,
        "Department of Planning and Development",
        properties={
            "@type": "Organization",
            "name": "Philadelphia, Pennsylvania Department of Planning and Development",
            "url": "https://www.phila.gov/departments/department-of-planning-and-development/"
        },
    )
)

In [5]:
philly_license = crate.add(
    ContextEntity(
        crate,
        "Philadelphia Data Terms Of Use",
        properties={
            "@type": "License",
            "name": "Philadelphia Data Terms Of Use",
            "url": "https://www.phila.gov/terms-of-use/",
        },
    )
)

In [6]:
date_published = date(2017, 3, 1).isoformat()

In [7]:
crate.name = name
crate.publisher = dept_planning
crate.creator = dept_planning
crate.license = philly_license
crate.datePublished = date_published
crate.keywords = ["OpenDataPhilly", "Philadelphia City Planning Commission", "Streets", "Geometry", "Ingested Data", "Raw"]

### Dataset

- source (File(s))
- description
- contentSize (MB, KB, or B)
- encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)

Contextual 

- encodingFormat
    - @id PRONOM url
    - @type Website
    - name

In [8]:
encoding_format = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/fmt/1367",
        properties={
            "@type": "Website",
            "name": "application/geo+json",
        },
    )
)

In [9]:
# Dataset
raw_file = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/OpenDataPhillyCompleteStreets.geojson",
        properties={
            "contentSize": "34.59MB",
            "encodingFormat": [encoding_format.properties()["name"], {"@id": encoding_format.id}],
            "sameAs": "https://opendata.arcgis.com/datasets/ed90e9016aab4c429cb7dd8aef2a87a3_0.geojson",
        }
    )
)

raw_dataset = crate.add(
    Dataset(
        crate,
        source=f"file:///data/extract/{short_name}",
        properties={
            "description": "Complete Centerline Geometries and Descriptions of Streets in Philadelphia",
            "sameAs": "https://metadata.phila.gov/#home/datasetdetails/5543867320583086178c4f34/",
            "hasParts": [
                {"@id": raw_file.id},
            ]
        }
    )
)

In [10]:
# Main Entry
crate.mainEntity = raw_dataset

In [11]:
crate.mainEntity.source

'file:///data/extract/OpenDataPhillyStreets'

### Store Research Object

In [12]:
crate.write(f"../../metastore/{short_name}/")

## Confirm Usage of Crate

In [13]:
read_crate = ROCrate(f"../metastore/{short_name}/")

In [14]:
for e in read_crate.get_entities():
    print(e.id, e.type)

./ Dataset
ro-crate-metadata.json CreativeWork
file:///data/extract/OpenDataPhillyStreets/OpenDataPhillyCompleteStreets.geojson File
file:///data/extract/OpenDataPhillyStreets/ Dataset
#Department of Planning and Development Organization
#Philadelphia Data Terms Of Use License
https://www.nationalarchives.gov.uk/PRONOM/fmt/1367 Website


In [15]:
main_id = read_crate.mainEntity.id
main_id

'file:///data/extract/OpenDataPhillyStreets/'

In [16]:
file_id = read_crate.get(main_id).properties().get("hasParts")[0]["@id"]
file_id

'file:///data/extract/OpenDataPhillyStreets/OpenDataPhillyCompleteStreets.geojson'

In [17]:
read_crate.get(file_id).source

'file:///data/extract/OpenDataPhillyStreets/OpenDataPhillyCompleteStreets.geojson'