# Enriching the Visier Model
## Overview
In this Jupyter notebook sample, learn how to use the Direct Data Intake (DDI) API in Extension mode to extend the data in a Visier object that is primarily loaded using other data transfer methods.
This sample's purpose is to show how to call Visier's public APIs to query source data, combine source data with an external data set, and then send back the combined data to Visier.

In this sample, we combine Employee data in Visier with the United States government Social Security Administration's (SSA) [publicly available name data](https://www.ssa.gov/oact/babynames/limits.html) to populate a new Employee property called Name_Rank.

### Prerequisites
Before running this sample, do the following:
1. In your Visier tenant, create a simple numeric property with the object name `Name_Rank` on the `Employee` object in Visier. Publish the property to production.
1. In your Visier tenant, create a profile and assign it to your API user with the following capabilities and additional capabilities:
   1. Data: Write, API
   1. Model: Read, API
   1. Direct Publish
   1. Legacy API Access
   1. Manage Jobs
   1. Upload Data
1. Defined an `.env` file as described in the [Python Connector documentation](https://github.com/visier/connector-python#jupyter-basic-authentication-example)
1. Downloaded the [National Data file](https://www.ssa.gov/oact/babynames/names.zip) and extract file `yob2022.txt`.
1. Copy `yob2022.txt` into a directory named `data`.

## Load Extension Data
We use the [pandas](https://pandas.pydata.org/docs/) library both to represent data sets through [Dataframes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) as well as for file handling. We load data with the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method because the source files are comma-separated.
In the following code, we provide a header with names that exactly match the names of Visier properties. We provide headers in the code because the source file doesn't contain a header row, so we must provide a header to align with Visier's property names. In addition to headers, we perform transformations to sort the results by occurrence of each name and assign the name rank based on the row number of the sorted lists.
We also define an index to facilitate a subsequent join-operation with Visier `Employee` data.

The SSA designates names as Female or Male. From this data, we create two distinct SSA name data sets: `f_names` and `m_names`.

(We will address the simplification of Visier's gender dimension later on, when we actually do that in the code)

In [1]:
import pandas as pd

def filter_and_index(df, gender):
    df.loc[df['Gender'] == gender]
    df['Name_Rank'] = df.reset_index().index + 1
    return df.set_index(['First_Name', 'Gender'])

names = pd.read_csv('data/yob2022.txt', header=0, names=['First_Name', 'Gender', 'Count'])
names.sort_values(inplace=True, by='Count', ascending=False)
f_names = filter_and_index(names, 'F')
m_names = filter_and_index(names, 'M')

## Query for Employee Data
Next, we instantiate the Visier Python Connector. To do so, we read the credentials, build an `auth` object to use to sign into Visier, and then get a session object, `s`.
Because we need to execute a Visier SQL-like query to get the list of employees whose name we want to rank in this sample, we import the `QueryApiClient`.

The SSA designates gender into two categories: Female and Male. Visier's Gender dimension contains more than two genders, including Woman, Man, Non-Binary, and more. To align Visier with SSA's data categorization, we run two SQL-like queries: one where `isFemale=TRUE` and one where `isMale=TRUE`
. This allows us to ascribe gender keys from the SSA data set to the data in Visier.

In [4]:
import re
from dotenv import dotenv_values
from visier.connector import VisierSession, make_auth
from visier.api import QueryApiClient

env_creds=dotenv_values()
auth = make_auth(env_values=env_creds)

def mk_df(client, query):
    """Run a list query and return the results as a Pandas DataFrame.
    Transforms the result by ensuring the First_Name column contains
    single words"""
    result = client.sqllike(query)
    df = pd.DataFrame.from_records(data=result.rows(), columns=result.header)
    df['First_Name'] = df['First_Name'].transform(lambda x: re.split(r'[ -]', x)[0])
    return df

    
with VisierSession(auth) as s:
    query_client = QueryApiClient(s)
    [f_df, m_df] = [mk_df(query_client, f"""SELECT EmployeeID,
                                                   First_Name
                                            FROM Employee
                                            WHERE {concept}=TRUE AND Visier_Time BETWEEN date('2021-01-01') AND date('2022-01-01')""") for concept in ["isFemale", "isMale"]]
    f_df['Gender'] = 'F'
    m_df['Gender'] = 'M'

## Prepare the Data Set to Upload
We now have two distinct data sets: the Visier `Employee` data and the SSA name ranking data. We can now bring them together to add Name_Rank data to the Visier `Employee` data.
Handling the two gender designations independently, we join the Visier data with the SSA data set after ensuring the Visier data set has been suitably indexed to enable an `inner` join at the correct level of granularity.

### EventDate
After concatenating the two data sets, we add a column named `EventDate` and give it a date string in `yyyy-MM-dd` format. In this sample, we choose the value `yesterday`. The `yesterday` value means that employees whose first names were found in the SSA data set will have their current state record ended and immediately followed by a new record that reflects the change.

### Filename
After updating the DataFrame, we save it to a file that will upload to Visier in the next step. In this case, where we upload the file as is, the name of the file itself doesn't matter. However, if the data set is large enough to need compression, such as through a `.zip` archive, the name of the file inside the archive matters. In this case, because we are augmenting the data of an object that is primarily loaded through other methods, we must name the file after the target object name with the suffix `DDIExt`. In this example, the filename is `EmployeeDDIExt.csv`

**Note**: If using the DDI API with a Primary data intake mode (that is, DDI API is the main data transfer method), the name of the file inside the compressed archive (such as a ZIP file) must match the name of the target object; for example, `Employee.csv`.

In [56]:
import datetime

def do_join(visier_df, ssa_df):
    result=visier_df.set_index(['First_Name', 'Gender']).join(ssa_df, how='inner')
    return result.reset_index()[['EmployeeID', 'Name_Rank']]

# Build each gender result and combine
f_result=do_join(f_df, f_names)
m_result=do_join(m_df, m_names)
result=pd.concat([f_result, m_result])

# Data Set requires column `EventDate` with the date of the change
yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
yesterday_str = yesterday.strftime('%Y-%m-%d')
result['EventDate'] = yesterday_str

# Write to file to be uploaded below
result_filename='EmployeeDDIExt.csv'
result.to_csv(result_filename, index=False)

## Upload the Data Set with the Direct Data Intake API
To upload the data, we use the Python connector, which is instantiated in the same way that was described above.

However, the initialization differs when instantiating a different API client, that is, the `DirectIntakeApiClient`.

In this sample, we assume that most Visier customers use other means of sending data to Visier, such as SFTP or data connectors, which are processed in Visier's data provisioning engine. Because of this assumption, this sample uses the Direct Data Intake API to supplement existing data in Visier and extend the `Employee` object. Because the Supplemental data intake mode is not the default mode, we must configure the data intake mode for this data upload.

### Upload Transaction
The DDI API uses a transactional process that allows callers to upload many data files before sending the files to Visier. In this sample, we are sending one data file to Visier. To send our data file to Visier, the DDI API requires three calls:
* `start_transaction` to begin a transaction. The response contains the transaction ID, which we must retain for the next two calls.
* `upload_file` to specify a target object and provide a data file with columns that match the properties of the target object that we want to load.
* `commit_transaction` to close the transaction and process the data files in Visier.

In [None]:
from visier.api import DirectIntakeApiClient
from visier.api.direct_intake import Configuration

with VisierSession(auth) as s:
    intake_client = DirectIntakeApiClient(s, raise_on_error=True)

    # Configure the Direct Intake to supplement data in the tenant
    # Enable loading into Employee using extension tables
    config = Configuration(is_supplemental=True,
                           extend_objects=['Employee'])
    returned_config = intake_client.set_configuration(config)
    
    # Upload the file within the context of a transaction
    try:
        tx_response = intake_client.start_transaction().json()
        print(tx_response)
        transaction_id = tx_response['transactionId']
        intake_client.upload_file(transaction_id, 'Employee', result_filename)
        intake_client.commit_transaction(transaction_id)
        print(f'Committed {transaction_id}')
    except Exception as ex:
        print(f'Rolling back {transaction_id}', ex)
        intake_client.rollback_transaction(transaction_id)

Once the provisioning is complete, the data is available to query.