## Introduction
 This notebook is designed to extract target variables across various scopes, including zipcode, county, state, and country. The provided code snippet can be used as a template for your specific use.


In [1]:
import requests
import pandas as pd
import datacommons_pandas as dcpd
import urllib

For consistency, we're using the public apikey provided by Google Data Common. a better practice should be using secret manager, but since it's a public api key, we'll leave it here for reference

In [2]:
api_key = "AIzaSyCTI4Xz-UW_G2Q2RfknhcfdAnTHq5X5XuI"

## I. Fetch Scope

You can fetch all zip codes, counties, states, and countries using the provided code snippets. The returned places will include their respective 'dcid' identifiers. Depending on your use case, you may need to reformat these identifiers.

#### FIPS Codes Overview

- **State FIPS Codes:** These are two-digit codes that uniquely identify each U.S. state and territory. 
  - Example: `01` is the FIPS code for Alabama.
  - Example: `06` is the FIPS code for California.

- **County FIPS Codes:** These are five-digit codes where:
  - The first two digits represent the state FIPS code.
  - The next three digits represent the county code within that state.
  - Example: `01001` is the FIPS code for Autauga County, Alabama (`01` for Alabama and `001` for Autauga County).

#### Identifier Formats for Retrieved Data

- **Zipcodes:** Returned in the format `'zip/02908'`. 
- **Counties:** Returned in the format `'geoId/01001'`.
- **States:** Returned in the format `'geoId/01'`.
- **Countries:** Returned in the format `'country/ABW'` (ISO 3166-1 alpha-3 code, e.g., `ABW` for Aruba).

### a. Fetching 'Places' by Scope

To retrieve all 'places' within a given scope—whether it's a county, zip code, state, or country—you can use the REST v2 API. Keep in mind that after retrieving `all_in_arcs`, a cleansing process may be necessary to remove irrelevant metadata.

This section provides the logistics of fetching 'places'. Refer to it as needed for other parts of your project.

If you prefer not to delve into the detailed process, you can directly use the `get_zipcodes`, `get_counties`, `get_states`, or `get_countries` functions to obtain a list of the required 'places'.

**Reference Table:**

| Scope        | GDC DCID                                   | GDC Class Name                  |
|--------------|--------------------------------------------|---------------------------------|
| Zip Codes    | CensusZipCodeTabulationArea                | CensusZipCodeTabulationArea     |
| Counties     | geoId/01001                                | County                          |
| States       | geoId/01                                   | State                           |
| Countries    | country/ABW                                | Country                         |


In [3]:
def get_inarcs_for_scope_paginated(scope, api_key, page_token=None):
    """
    Fetch in-arcs for a given scope using Google Data Commons API, with optional pagination.

    Parameters:
    - scope (str): The scope for which to fetch in-arcs (e.g., zip code, county, state).
    - api_key (str): The API key for authenticating with the Google Data Commons API.
    - page_token (str, optional): Token for fetching the next page of results, if available.

    Returns:
    - arcs (list): A list of in-arcs representing the 'places' within the specified scope.
    - nextPageToken (str or None): The token for the next page of results, or None if no further pages exist.
    """
    # Base URL for the API request
    base_url = f"https://api.datacommons.org/v2/node?key={api_key}&nodes={scope}&property=<-*"
    
    # Append page token if available
    if page_token:
        encoded_token = urllib.parse.quote(page_token)  # URL-encode the token
        url = f"{base_url}&nextToken={encoded_token}"
    else:
        url = base_url
    
    # Make the API request
    response = requests.get(url)
    data = response.json()
    
    # Extract in-arcs and the nextPageToken
    arcs = data['data'][list(data['data'].keys())[0]]['arcs']['typeOf']['nodes']
    
    if 'nextToken' not in data:
        return arcs, None
    else:
        return arcs, data['nextToken']

def get_all_inarcs(class_name, api_key):
    """
    Fetch all in-arcs for a specified class, iterating through paginated results.

    Parameters:
    - class_name (str): The class name or scope for which to retrieve all in-arcs.
    - api_key (str): The API key for authenticating with the Google Data Commons API.

    Returns:
    - all_in_arcs (list): A comprehensive list of all in-arcs for the specified class.
    """
    all_in_arcs = []
    page_token = None
    
    # Continue fetching data until there's no nextPageToken
    while True:
        in_arcs, page_token = get_inarcs_for_scope_paginated(class_name, api_key, page_token)
        all_in_arcs.extend(in_arcs)
        # Exit the loop if there's no next page
        if not page_token:
            break
    
    return all_in_arcs

In [4]:
def get_zipcodes():
    all_in_arcs = get_all_inarcs('CensusZipCodeTabulationArea', api_key)
    places = [item['dcid'] for item in all_in_arcs]
    return places 

In [5]:
def get_counties():
    all_in_arcs = get_all_inarcs('County', api_key)
    places = [item['dcid'] for item in all_in_arcs]
    return places 

In [6]:
def get_states():
    all_in_arcs = get_all_inarcs('State', api_key)
    places = [item['dcid'] for item in all_in_arcs]
    return places 

In [7]:
def get_countries():
    all_in_arcs = get_all_inarcs('Country', api_key)
    places = [item['dcid'] for item in all_in_arcs]
    return places 

### b. Example Usages to Fetch Different Scopes

The following examples demonstrate how to fetch various scopes and print the top five results for each.

In [5]:
# Example Usage - Zipcode
zipcode_dcids = get_zipcodes()

In [6]:
zipcode_dcids[:5]

['zip/00601', 'zip/00602', 'zip/00603', 'zip/00604', 'zip/00606']

In [10]:
# Example Usage - County
county_dcids = get_counties()

In [11]:
county_dcids[:5]

['geoId/01001', 'geoId/01003', 'geoId/01005', 'geoId/01007', 'geoId/01009']

In [12]:
# Example Usage - State
state_dcids = get_states()

In [13]:
state_dcids[:5]

['geoId/01', 'geoId/02', 'geoId/04', 'geoId/05', 'geoId/06']

In [14]:
# Example Usage - Country 
country_dcids = get_countries()

In [15]:
country_dcids[:5]

['country/ABW', 'country/AFG', 'country/AFI', 'country/AGO', 'country/AIA']

## II. Fetch Target

In this section, we will retrieve target variables using the `datacommons_pandas` package. This approach is organized and straightforward, as we have all the necessary DCIDs for the places within our specified scope. 

The main method used is `fetch_data_and_process_dpd(places, target_variable_dcid, api_key, threshold=0.8)`, 
where the threshold parameter defaults to 20% but can be adjusted according to your needs. Depending on specific requirements, additional steps may be needed to clean and transform the DataFrame.

The resulting DataFrame will include the following columns:

- **PlaceDCID**
- **PlaceName**
- **Target** (0/1): This column contains the classified value.
- **Variable_DCID**: For instance, 'Count_Person', which contains the raw value.

The target variable is classified based on a threshold that defaults to 20%:

- **0** for values below the top 20%
- **1** for values in the top 20%

To meet different needs, you can easily create a new DataFrame with only the desired columns.

In [7]:
def classify_target(raw_value, threshold_val):
    """Classify the raw value based on the top 20% threshold."""
    return 1 if raw_value >= threshold_val else 0

def fetch_data_and_process_dpd(places, target_variable_dcid, api_key, threshold = 0.8):
    """
    Fetch and process data for the specified places and target variable.

    Parameters:
    - places: List of geoIds or place identifiers.
    - target_variable_dcid: Variable DCID to fetch.
    - api_key: Google Data Commons API key.

    Returns:
    - DataFrame containing Place DCID, Place Name, Target, and Raw Value.
    """
    df = dcpd.build_multivariate_dataframe(places, target_variable_dcid)
    df.reset_index(inplace=True)
    df.rename(columns={'place': 'PlaceDCID'}, inplace=True)

    threshold_val = df[target_variable_dcid].quantile(threshold)
    df['Target'] = df[target_variable_dcid].apply(lambda value: classify_target(value, threshold_val))

    unique_dcids = df['PlaceDCID'].unique().tolist()
    names_dict = dcpd.get_property_values(unique_dcids, 'name')
    df['PlaceName'] = df['PlaceDCID'].map(lambda dcid: names_dict.get(dcid, [None])[0] if names_dict.get(dcid) else None)

    return df[['PlaceDCID', 'PlaceName', 'Target', target_variable_dcid]]

### Example Usage:

The examples utilize the target variable with the DCID `Count_Person` and demonstrate its application across different geographical scopes.

In [8]:
zipcode_df = fetch_data_and_process_dpd(zipcode_dcids, 'Count_Person', api_key)

In [9]:
zipcode_df.head()

Unnamed: 0,PlaceDCID,PlaceName,Target,Count_Person
0,zip/00601,601,0,17126
1,zip/00602,602,1,37895
2,zip/00603,603,1,49136
3,zip/00606,606,0,5751
4,zip/00610,610,1,26153


### III. Fetch zip-code data with RESTv2 API

In [24]:
def fetch_data_and_process(target_variable_dcid, api_key):
    # Construct the URL with the target_variable_dcid directly in the string
    url = f"https://api.datacommons.org/v2/observation?key={api_key}&date=LATEST&variable.dcids={target_variable_dcid}&entity.expression=country%2FUSA%3C-containedInPlace%2B%7BtypeOf%3ACensusZipCodeTabulationArea%7D&select=date&select=entity&select=value&select=variable"
    
    # Make the API request using the simplified structure
    response = requests.post(url, headers={'Content-Type': 'application/json'}, json={"dates": ""})
    data = response.json()
    
    # Extract all GeoIDs (ZIP Codes in this case)
    geo_ids = data['byVariable'][target_variable_dcid]['byEntity'].keys()
    
    # Initialize a dictionary to store the latest data
    zip_data = {}

    # Process each GeoID to extract the latest observation data
    for geo_id in geo_ids:
        ordered_facets = data['byVariable'][target_variable_dcid]['byEntity'][geo_id]['orderedFacets']
        
        # Find the latest year and its corresponding value
        latest_observation = max(
            (obs for facet in ordered_facets for obs in facet['observations']),
            key=lambda obs: int(obs['date'])
        )
        
        # Store the latest value in the dictionary
        zip_data[geo_id[4:]] = latest_observation['value']
    
    # Convert the dictionary to a DataFrame
    df = pd.DataFrame(zip_data.items(), columns=['ZipCode', target_variable_dcid])
    
    # Calculate the 80th percentile threshold / Top 20 %
    threshold = df[target_variable_dcid].quantile(0.8)
    
    # Create the target column based on the threshold
    df[target_variable_dcid] = df[target_variable_dcid].apply(lambda x: 1 if x >= threshold else 0)
    
    return df


In [25]:
# Example usage
target_variable_dcid = "Count_Person"  # Replace with your DCID
df = fetch_data_and_process(target_variable_dcid, api_key)
print(df)

      ZipCode  Count_Person
0       64449             0
1       27956             0
2       79079             0
3       62702             1
4       85009             1
...       ...           ...
33966   01260             0
33967   74063             1
33968   17534             0
33969   53119             0
33970   30555             0

[33971 rows x 2 columns]
