# Data Cataloging Basics

## What is a data catalog?
A data catalog is a centralized repository that stores, organizes, and manages metadata about an organization's datasets and data assets. It enables users to easily discover, understand, and access the available data by providing structured information about the datasets, such as their descriptions, formats, sources, usage policies, and relationships with other datasets.

The primary purpose of a data catalog is to help data users, such as analysts, data scientists, and decision-makers, find relevant data quickly and efficiently. A data catalog often includes search and discovery features, allowing users to search for datasets based on keywords, tags, or other criteria. It may also provide data lineage and data profiling information, helping users assess the quality and suitability of the data for their specific use cases.

Data catalogs can be created and maintained manually, but many organizations use data catalog tools or platforms to automate the process of cataloging data and managing metadata. These tools often provide additional features, such as data lineage visualization, data quality management, and integration with data processing and analytics tools.

In summary, a data catalog is a centralized repository for managing metadata about an organization's datasets and data assets, enabling users to discover, understand, and access relevant data more efficiently.

## What is data.gov? Is it based on CKAN.org?

Data.gov, the U.S. federal government's open data portal, is indeed based on CKAN. However, there is no single specific format or metadata standard that all government agencies must adopt for their own data catalogs. The choice of format or standard depends on the agency's requirements, the types of data they manage, and their interoperability needs.

That said, some metadata standards and guidelines are widely adopted by government agencies, including:

Project Open Data Metadata Schema (DCAT-US): DCAT-US is a profile of the Data Catalog Vocabulary (DCAT), specifically designed for U.S. federal government datasets. It is an extension of the W3C's DCAT standard and includes additional metadata elements tailored to the needs of U.S. federal agencies. Data.gov uses the DCAT-US schema to harvest metadata from various agency data catalogs. More information can be found at https://project-open-data.cio.gov/v1.1/schema/.

Dublin Core: As mentioned earlier, Dublin Core is a simple and widely-used metadata standard that is suitable for various types of digital resources, including datasets. CKAN, the platform behind Data.gov, uses a built-in metadata schema based on Dublin Core.

ISO 19115: For geospatial data, many government agencies adopt the ISO 19115 standard, which provides a comprehensive metadata schema for describing geographic information and services.

While these metadata standards and guidelines are widely used, each government agency may have its own requirements and preferences for data cataloging. In practice, government agencies often develop their data catalogs based on a combination of existing metadata standards, guidelines, and custom extensions tailored to their specific needs.

In conclusion, there is no one-size-fits-all metadata standard for government agencies' data catalogs. Each agency should choose a format or standard that aligns with their requirements, the types of data they manage, and their interoperability needs with other agencies or data consumers.

## What is CKAN?

CKAN (Comprehensive Knowledge Archive Network) is an open-source data management system that provides a platform for publishing, sharing, and finding datasets. It is designed to help organizations create and maintain data portals, data catalogs, or data hubs that make their datasets easily discoverable, accessible, and usable by both internal and external users.

CKAN is widely used by governments, research institutions, non-profit organizations, and other entities to create public or private data portals. Some of its key features include:

Data organization: CKAN allows users to organize datasets into groups and collections, making it easier for users to discover and navigate related datasets.

Metadata management: CKAN supports built-in metadata schema based on Dublin Core and enables users to extend the schema with custom fields, allowing organizations to incorporate additional metadata elements specific to their domain or requirements.

Data preview and visualization: CKAN provides built-in data preview functionality, enabling users to view and interact with data directly in the browser without downloading the entire dataset.

Data storage and API access: CKAN can store datasets directly in its file storage system or integrate with external storage solutions. It also provides a RESTful API that enables programmatic access to datasets and metadata.

User management and access control: CKAN supports user authentication, role-based access control, and customizable permission levels, allowing organizations to control access to their datasets and resources.

Extensibility and customization: CKAN is designed to be easily extended and customized through plugins and themes, making it adaptable to a wide range of use cases and requirements.

In summary, CKAN is an open-source data management system that helps organizations create data portals, catalogs, or hubs for publishing, sharing, and discovering datasets. It offers various features for data organization, metadata management, data preview, API access, and user management, making it a versatile and widely-used solution for data management.

## Let's access data.gov using ckanapi

First, you need to install the ckanapi library if you haven't already. You can do this by running the following command in your terminal or in a Jupyter Notebook cell:

In [11]:
#pip install ckanapi


Now, you can create a new Jupyter Notebook and add the following code in separate cells:

Import the required libraries:

In [1]:
import pandas as pd
from ckanapi import RemoteCKAN

Connect to a CKAN instance (we'll use the Data.gov instance as an example):

In [2]:
ckan = RemoteCKAN('https://catalog.data.gov', user_agent='ckanapi-example/1.0')

Retrieve a list of datasets (we'll limit the results to 10 for this example):

In [3]:
result = ckan.action.package_search(rows=100)
packages = result['results']

Extract relevant information from the retrieved datasets and display it in a pandas DataFrame:

In [4]:
data = []

for package in packages:
    data.append({
        'name': package['name'],
        'title': package['title'],
        'organization': package['organization']['title'],
        'url': package['url'],
        'last_modified': package['metadata_modified']
    })

df = pd.DataFrame(data)
df

Unnamed: 0,name,title,organization,url,last_modified
0,fdic-failed-bank-list,FDIC Failed Bank List,Federal Deposit Insurance Corporation,,2020-11-12T12:17:38.682707
1,electric-vehicle-population-data,Electric Vehicle Population Data,State of Washington,,2023-03-18T10:13:50.819699
2,crime-data-from-2020-to-present,Crime Data from 2020 to Present,City of Los Angeles,,2023-03-25T01:47:39.336907
3,national-student-loan-data-system,National Student Loan Data System,Department of Education,,2020-11-10T16:22:35.428085
4,u-s-chronic-disease-indicators-cdi,U.S. Chronic Disease Indicators (CDI),U.S. Department of Health & Human Services,,2023-02-02T04:02:55.398724
...,...,...,...,...,...
95,retail-food-stores,Retail Food Stores,State of New York,,2023-02-24T01:40:52.620213
96,employee-payroll,Employee Payroll,Cook County of Illinois,,2022-06-30T01:53:51.085238
97,food-security-in-the-united-states,Food Security in the United States,Department of Agriculture,,2021-02-24T15:37:03.956952
98,safer-company-snapshot,SAFER - Company Snapshot,Department of Transportation,,2021-04-25T14:00:08.755889


This example demonstrates a simple use case of interacting with a CKAN instance using the ckanapi library in a Jupyter Notebook. You can extend this example by exploring other actions provided by the CKAN API, such as searching for datasets based on specific criteria, retrieving dataset metadata, or uploading new datasets. For more information on the available API actions, refer to the CKAN API documentation: https://docs.ckan.org/en/2.10/api/index.html

In [14]:
pd.options.display.max_rows = 1000

In [15]:
#df

In [16]:
selected_package = None
csv_resource = None

for package in packages:
    if package['name'] == 'fdic-failed-bank-list':  # Replace 'the_dataset_name' with the name of the dataset you want to download
        selected_package = package
        break

if selected_package:
    for resource in selected_package['resources']:
        if resource['format'].lower() == 'csv':
            csv_resource = resource
            break
else:
    print("Dataset not found.")

if csv_resource:
    csv_url = csv_resource['url']
else:
    print("CSV resource not found.")

In [17]:
import requests
import io

if csv_url:
    response = requests.get(csv_url)
    response.raise_for_status()

    csv_data = io.StringIO(response.text)
    df2 = pd.read_csv(csv_data)
    print(df2.head())
else:
    print("CSV URL not found.")

                   Bank Name               City  State   Cert   \
0              Signature Bank           New York     NY  57053   
1         Silicon Valley Bank        Santa Clara     CA  24735   
2           Almena State Bank             Almena     KS  15426   
3  First City Bank of Florida  Fort Walton Beach     FL  16748   
4        The First State Bank      Barboursville     WV  14361   

                Acquiring Institution  Closing Date    Fund  
0                  Flagstar Bank, N.A.     12-Mar-23  10540  
1  FirstCitizens Bank & Trust Company     10-Mar-23  10539  
2                          Equity Bank     23-Oct-20  10538  
3            United Fidelity Bank, fsb     16-Oct-20  10537  
4                       MVB Bank, Inc.      3-Apr-20  10536  


In [19]:
df2.head()

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Flagstar Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,FirstCitizens Bank & Trust Company,10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
