# Data Ingestion of US Census Zip Shape Data

Add raw data to project and manually create metadata for the upstream source.

Each dataset gets its own research object to track them individually without creating a very large, interconnected research object for the whole project.

## Dataset Descriptions

Use attributes and entities as needed

- Attributes
    - name
    - publisher {"@id": "#local_id"} or {"@id": "url"}
    - creater {"@id": "#person"}
    - license
    - datePublished
    - keywords (ex. "streets, aggregated features, map")
    - mainEntity Output Dataset 
- Data Entities
    - File
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
        - sha256
    - Dataset
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
    - ComputationalWorkflow
        - @type (pre-defined with built-in models) ["File", "SoftwareSourceCode", "ComputationalWorkflow"]
        - source (File)
        - author
        - programmingLanguage
        - input [{"@id": "#id1"}, {"@id": "#id2"}]
        - output [{"@id": "#id3"}]
- Contextual Entities
    - Publisher
        - @type Organization
        - name
        - url
    - Person
        - name
    - programmingLanguage
        - @type ["ComputerLanguage", "SoftwareApplication"]
        - name
        - version
        - url
    - encodingFormat
        - @id PRONOM url
        - @type Website
        - name
    - input
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToUpstreamDataset"}
        - encodingFormat
        - valueRequired True
    - output
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToDownstreamDataset"}
        - encodingFormat
        - valueRequired True

In [1]:
from datetime import date
from rocrate.rocrate import ROCrate
from rocrate.model import (
    Person,
    File,
    Dataset,
    ComputationalWorkflow,
    ContextEntity,
)

In [2]:
crate = ROCrate()

### Main Attributes

- name
- publisher {"@id": "#local_id"} or {"@id": "url"}
- creater {"@id": "#person"}
- license
- datePublished
- keywords (ex. "streets, aggregated features, map")
- mainEntity Output Dataset

Contextual

- Publisher
    - @type Organization
    - name
    - url

In [3]:
name = "US 2020 Census ZCTA5 Geometries"
short_name = "US2020CensusZCTA5Geometry"

In [4]:
us_census_bureau = crate.add(
    ContextEntity(
        crate,
        "U.S. Census Bureau",
        properties={
            "@type": "Organization",
            "name": "U.S. Census Bureau",
            "url": "https://www.census.gov/"
        },
    )
)

In [5]:
us_public_domain = crate.add(
    ContextEntity(
        crate,
        "U.S. Public Domain",
        properties={
            "@type": "License",
            "name": "U.S. Public Domain",
            "url": "http://www.usa.gov/publicdomain/label/1.0/",
        },
    )
)

In [6]:
date_published = date(2021, 2, 2).isoformat()

In [7]:
crate.name = name
crate.publisher = us_census_bureau
crate.creator = us_census_bureau
crate.license = us_public_domain
crate.datePublished = date_published
crate.keywords = ["US Census", "US Census ZCTA5", "Geometry", "Ingested Data", "Raw"]

### Dataset

- source (File(s))
- description
- contentSize (MB, KB, or B)
- encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)

Contextual 

- encodingFormat
    - @id PRONOM url
    - @type Website
    - name

In [8]:
encoding_format_shp = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/235",
        properties={
            "@type": "Website",
            "name": "ESRI Arc/View ShapeFile",
        },
    )
)

In [9]:
encoding_format_shx = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/fmt/277",
        properties={
            "@type": "Website",
            "name": "ESRI Arc/View Shapefile Index",
        },
    )
)

In [10]:
encoding_format_prj = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/fmt/320",
        properties={
            "@type": "Website",
            "name": "ESRI Shapefile Projection (Well-Known Text) Format",
        },
    )
)

In [11]:
encoding_format_dbf = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/271",
        properties={
            "@type": "Website",
            "name": "dBASE Database",
        },
    )
)

In [12]:
encoding_format_cpg = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/fmt/1253",
        properties={
            "@type": "Website",
            "name": "	ESRI Code Page File",
        },
    )
)

In [13]:
# Dataset
raw_file_shp = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/{short_name}.shp",
        properties={
            "contentSize": "800MB",
            "encodingFormat": [encoding_format_shp.properties()["name"], {"@id": encoding_format_shp.id}],
        }
    )
)

raw_file_shx = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/{short_name}.shx",
        properties={
            "contentSize": "265KB",
            "encodingFormat": [encoding_format_shx.properties()["name"], {"@id": encoding_format_shx.id}],
        }
    )
)

raw_file_dbf = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/{short_name}.dbf",
        properties={
            "contentSize": "2.31MB",
            "encodingFormat": [encoding_format_dbf.properties()["name"], {"@id": encoding_format_dbf.id}],
        }
    )
)

raw_file_cpg = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/{short_name}.cpg",
        properties={
            "contentSize": "1KB",
            "encodingFormat": [encoding_format_cpg.properties()["name"], {"@id": encoding_format_cpg.id}],
        }
    )
)

raw_file_prj = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/{short_name}.prj",
        properties={
            "contentSize": "1KB",
            "encodingFormat": [encoding_format_prj.properties()["name"], {"@id": encoding_format_prj.id}],
        }
    )
)

raw_dataset = crate.add(
    Dataset(
        crate,
        source=f"file:///data/extract/{short_name}",
        properties={
            "description": "U.S. 2020 Decennial Census ZIP Geometries",
            "sameAs": "https://www.census.gov/geographies/mapping-files/2020/geo/tiger-line-file.html",
            "hasParts": [
                {"@id": raw_file_prj.id},
                {"@id": raw_file_cpg.id},
                {"@id": raw_file_shx.id},
                {"@id": raw_file_shp.id},
                {"@id": raw_file_dbf.id},
            ]
        }
    )
)

In [14]:
# Main Entry
crate.mainEntity = raw_dataset

In [15]:
crate.mainEntity.source

'file:///data/extract/US2020CensusZCTA5Geometry'

### Store Research Object

In [16]:
crate.write(f"../../metastore/{short_name}/")

## Confirm Usage of Crate

In [17]:
read_crate = ROCrate(f"../metastore/{short_name}/")

In [18]:
for e in read_crate.get_entities():
    print(e.id, e.type)

./ Dataset
ro-crate-metadata.json CreativeWork
file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.shp File
file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.shx File
file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.dbf File
file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.cpg File
file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.prj File
file:///data/extract/US2020CensusZCTA5Geometry/ Dataset
#U.S. Census Bureau Organization
#U.S. Public Domain License
https://www.nationalarchives.gov.uk/PRONOM/x-fmt/235 Website
https://www.nationalarchives.gov.uk/PRONOM/fmt/277 Website
https://www.nationalarchives.gov.uk/PRONOM/fmt/320 Website
https://www.nationalarchives.gov.uk/PRONOM/x-fmt/271 Website
https://www.nationalarchives.gov.uk/PRONOM/fmt/1253 Website


In [19]:
read_crate.mainEntity.id

'file:///data/extract/US2020CensusZCTA5Geometry/'

In [20]:
read_crate.get(read_crate.mainEntity.id).source

'file:///data/extract/US2020CensusZCTA5Geometry/'

In [21]:
read_crate.get(
    read_crate.get(read_crate.mainEntity.id)
    .properties().get("hasParts", [])[0]["@id"]
).source

'file:///data/extract/US2020CensusZCTA5Geometry/US2020CensusZCTA5Geometry.prj'