Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Socrata Data Nodes #306

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

UrbanGISer
Copy link
Contributor

Socrata Search: To search the database and get the data list
Socrata Data Query: get the data based on domain and ID。

@UrbanGISer UrbanGISer added the new node Special enhancement which has a new KNIME node as outcome label Nov 14, 2023
@UrbanGISer UrbanGISer added this to the Release 1.3 milestone Nov 14, 2023
@UrbanGISer UrbanGISer linked an issue Nov 14, 2023 that may be closed by this pull request
@koettert koettert force-pushed the 305-add-socrata-data-to-open-dataset-nodes branch from 3cb837e to 9901253 Compare January 9, 2024 14:43
description="Socrata dataset based on search keywords",
)
class SocrataSearchNode:
"""Retrive the open data category via Socrata API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieve. Please search for other occurrences and fix them as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised


query_item = self.queryitem
request = Request(
f"http://api.us.socrata.com/api/catalog/v1?q={query_item}&only=datasets&limit=10000"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs URL encoding e.g. entering two search strings with a space throws an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoded_query_item = quote(query_item)

# Create a DataFrame from the dataset information, and flatten the nested dictionaries
df = json_normalize(dataset_info)
df = df.drop(
columns=["classification.domain_tags", "classification.domain_metadata"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this code more resilient since it seems the columns are not always there e.g. searching for all_utah_fire_data_long_lat_2018_carto the node throws this exception: Execute failed: "['classification.domain_tags', 'classification.domain_metadata'] not found in axis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns_to_drop = ["classification.domain_tags", "classification.domain_metadata"]
columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df = df.drop(columns=columns_to_drop)


# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(self.resource_id, limit=100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the node do if the data has more than 100k rows? Can we use paging to loop through the whole result with progress and cancellation support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only for query the dataset list, not the data, it might not be necessary. The doc below mentioned that 2.1 version will allow for unlimited, But I haven't find a way to use the API 2.1
https://dev.socrata.com/docs/paging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add paging here
client = Socrata(self.metadata_domain, None)
limit = 100000
offset = 0
all_results = []
while True:
results = client.get(self.resource_id, limit=limit, offset=offset)
if not results:
break
all_results.extend(results)
offset += limit

name="Socrata dataset list",
description="Socrata dataset based on search keywords",
)
class SocrataSearchNode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rewrite the node description to mention first what kind of data can be retrieved instead of mentioning the technology that is used first which most users won't interest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socrata dataset list from a wealth of open data resources from governments, non-profits, and NGOs around the world based on the query term.

description="Socrata dataset based on search keywords",
)
class SocrataDataNode:
"""Retrive the open data category via Socrata API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rewrite the node description to mention first what kind of data can be retrieved instead of mentioning the technology that is used first which most users won't interest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access open datasets from various well-known data resources and organizations effortlessly using the SODA interface.

US Centers for Disease Control and Prevention (CDC): CDC data includes information on infectious diseases, chronic conditions, environmental health hazards, 
injury prevention, maternal and child health, immunization coverage, and much more. These datasets are collected through surveillance systems, population surveys, 
epidemiological studies, and collaborative research efforts conducted by the CDC and its partners.

Data.gov: The official open data platform of the United States government, offering datasets from various U.S. government agencies covering fields such as education, 
healthcare, transportation, and the environment.

Chicago Data Portal: The open data platform provided by the City of Chicago, offering datasets related to the city, including crime data, transportation data, demographic statistics, and more.

NYC Open Data: The open data platform provided by the City of New York, offering datasets covering urban planning, public transportation, health, and various other aspects of the city.

UK Government Data Service: The open data platform provided by the UK government, offering datasets from various governmental bodies covering economics, social issues, the environment, and more.

World Bank Data: The open data platform provided by the World Bank, offering a wide range of economic, social, and environmental datasets from around the world for research and analysis of global development trends.

@koettert koettert assigned UrbanGISer and unassigned koettert Jan 9, 2024

queryitem = knext.StringParameter(
label="Input searching item",
description="""Enter search keywords or dataset names to find relevant datasets in the Socrata database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more comprehensive description about what is possible here e.g. complex queries etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://dev.socrata.com/docs/filtering
For querying dataset list, there might be no need to do this. I will add some function in querying Dataset.

Copy link
Contributor

@koettert koettert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we come up with a better name for the nodes since Socrata

@UrbanGISer UrbanGISer assigned koettert and unassigned UrbanGISer Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new node Special enhancement which has a new KNIME node as outcome
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add socrata data to Open Dataset nodes
2 participants