# Data Ingestion of US Census Block Data

Add raw data to project and manually create metadata for the upstream source.

Each dataset gets its own research object to track them individually without creating a very large, interconnected research object for the whole project.

## Dataset Descriptions

Use attributes and entities as needed

- Attributes
    - name
    - publisher {"@id": "#local_id"} or {"@id": "url"}
    - creater {"@id": "#person"}
    - license
    - datePublished
    - keywords (ex. "streets, aggregated features, map")
    - mainEntity Output Dataset 
- Data Entities
    - File
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
        - sha256
    - Dataset
        - source (File(s))
        - description
        - contentSize (MB, KB, or B)
        - encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)
    - ComputationalWorkflow
        - @type (pre-defined with built-in models) ["File", "SoftwareSourceCode", "ComputationalWorkflow"]
        - source (File)
        - author
        - programmingLanguage
        - input [{"@id": "#id1"}, {"@id": "#id2"}]
        - output [{"@id": "#id3"}]
- Contextual Entities
    - Publisher
        - @type Organization
        - name
        - url
    - Person
        - name
    - programmingLanguage
        - @type ["ComputerLanguage", "SoftwareApplication"]
        - name
        - version
        - url
    - encodingFormat
        - @id PRONOM url
        - @type Website
        - name
    - input
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToUpstreamDataset"}
        - encodingFormat
        - valueRequired True
    - output
        - @type FormalParameter
        - name
        - defaultValue {"@id": "#ReferenceToDownstreamDataset"}
        - encodingFormat
        - valueRequired True

In [1]:
from datetime import date
from rocrate.rocrate import ROCrate
from rocrate.model import (
    Person,
    File,
    Dataset,
    ComputationalWorkflow,
    ContextEntity,
)

In [2]:
crate = ROCrate()

### Main Attributes

- name
- publisher {"@id": "#local_id"} or {"@id": "url"}
- creater {"@id": "#person"}
- license
- datePublished
- keywords (ex. "streets, aggregated features, map")
- mainEntity Output Dataset

Contextual

- Publisher
    - @type Organization
    - name
    - url

In [3]:
name = "US 2020 Census Zip Population Data for Pennsylvania"
short_name = "US2020CensusZCTA5PA"

In [4]:
us_census_bureau = crate.add(
    ContextEntity(
        crate,
        "U.S. Census Bureau",
        properties={
            "@type": "Organization",
            "name": "U.S. Census Bureau",
            "url": "https://www.census.gov/"
        },
    )
)

In [5]:
us_public_domain = crate.add(
    ContextEntity(
        crate,
        "U.S. Public Domain",
        properties={
            "@type": "License",
            "name": "U.S. Public Domain",
            "url": "http://www.usa.gov/publicdomain/label/1.0/",
        },
    )
)

In [6]:
# Date Retrieved from US Census Data Portal
date_published = date(2024, 3, 17).isoformat()

In [7]:
crate.name = name
crate.publisher = us_census_bureau
crate.creator = us_census_bureau
crate.license = us_public_domain
crate.datePublished = date_published
crate.keywords = ["US Census", "US Census Block", "Population", "Ingested Data", "Raw"]

### Dataset

- source (File(s))
- description
- contentSize (MB, KB, or B)
- encodingFormat (from PRONOM: https://www.nationalarchives.gov.uk/PRONOM)

Contextual 

- encodingFormat
    - @id PRONOM url
    - @type Website
    - name

In [8]:
encoding_format = crate.add(
    ContextEntity(
        crate,
        "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/18",
        properties={
            "@type": "Website",
            "name": "text/csv",
        },
    )
)

In [9]:
# Dataset
raw_file = crate.add (
    File(
        crate,
        source=f"file:///data/extract/{short_name}/USCensusDecennialDP2020DP1.csv",
        properties={
            "description": "U.S. 2020 Decennial Census for the Survey and Populations and People Zip Data for Pennsylvania",
            "contentSize": "3.48MB",
            "encodingFormat": [encoding_format.properties()["name"], {"@id": encoding_format.id}],
        }
    )
)

raw_dataset = crate.add(
    Dataset(
        crate,
        source=f"file:///data/extract/{short_name}",
        properties={
            "description": "U.S. 2020 Decennial Census for the Survey and Populations and People Zip Data for Pennsylvania",
            "hasParts": [
                {"@id": raw_file.id},
            ]
        }
    )
)

In [10]:
# Main Entry
crate.mainEntity = raw_dataset

### Store Research Object

In [11]:
crate.write(f"../../metastore/{short_name}/")

## Confirm Usage of Crate

In [12]:
read_crate = ROCrate(f"../metastore/{short_name}/")

In [13]:
for e in read_crate.get_entities():
    print(e.id, e.type)

./ Dataset
ro-crate-metadata.json CreativeWork
file:///data/extract/US2020CensusZCTA5PA/USCensusDecennialDP2020DP1.csv File
file:///data/extract/US2020CensusZCTA5PA/ Dataset
#U.S. Census Bureau Organization
#U.S. Public Domain License
https://www.nationalarchives.gov.uk/PRONOM/x-fmt/18 Website


In [14]:
read_crate.mainEntity.id

'file:///data/extract/US2020CensusZCTA5PA/'

In [15]:
file_id = read_crate.get(read_crate.mainEntity.id).properties()["hasParts"][0]["@id"]
file_id

'file:///data/extract/US2020CensusZCTA5PA/USCensusDecennialDP2020DP1.csv'

In [16]:
read_crate.get(file_id).properties()

{'@id': 'file:///data/extract/US2020CensusZCTA5PA/USCensusDecennialDP2020DP1.csv',
 '@type': 'File',
 'contentSize': '3.48MB',
 'description': 'U.S. 2020 Decennial Census for the Survey and Populations and People Zip Data for Pennsylvania',
 'encodingFormat': ['text/csv',
  {'@id': 'https://www.nationalarchives.gov.uk/PRONOM/x-fmt/18'}]}

In [17]:
read_crate.get(file_id).__class__

rocrate.model.file.File

In [18]:
read_crate.get(file_id).source

'file:///data/extract/US2020CensusZCTA5PA/USCensusDecennialDP2020DP1.csv'