## Overview
This notebook provides the details of creating the exploratory and data representation observations for an analytics example. In an analytic setting, the method or computational approach to surface the insight is typically known. So it is matter of running a set of queries or running an algorithm on the data. This is the case here. The guidelines described in [the observation glossary](https://github.com/rajivsam/KMDS/blob/main/feature_documentation/glossary_observation_types.md) are used to generate the exploratory observations and the data representation observations.

In [1]:
import pandas as pd
fp ="../../data/sba_7a_loans.csv"
df = pd.read_csv(fp)

  df = pd.read_csv(fp)


In [2]:
from kmds.ontology.kmds_ontology import *
from kmds.tagging.tag_types import ExploratoryTags

kaw = KnowledgeApplicationWorkflow("SBA Loan Chargeoff Analysis 2023 data", namespace=onto)

In [3]:
exp_obs_list = []
observation_count :int = 1
e1 = ExploratoryObservation(namespace=onto)
e1.finding = "Only (NaicsCode, BorrState, LoanStatus) are needed for this analysis. We don't care about other data elements for this report."
e1.finding_sequence = observation_count
e1.exploratory_observation_type = ExploratoryTags.RELEVANCE_OBSERVATION.value
exp_obs_list.append(e1)

In [4]:
observation_count += 1
e2 = ExploratoryObservation(namespace=onto)
e2.finding = "The data is government published data, have to take the quality at face value :-)"
e2.finding_sequence = observation_count
e2.exploratory_observation_type = ExploratoryTags.RELEVANCE_OBSERVATION.value
exp_obs_list.append(e2)

In [5]:
kaw.has_exploratory_observations = exp_obs_list

In [6]:
subset_cols = ["NaicsCode", "BorrState", "LoanStatus"]
df = df[subset_cols]
include_loan_status = df.LoanStatus != "COMMIT"
df = df[include_loan_status]

In [7]:
from kmds.tagging.tag_types import DataRepresentationTags
dr_obs_list = []
observation_count = 1
dr1 = DataRepresentationObservation(namespace=onto)
dr1.finding = "See the data dictionary in the repo data directory, the business type is captured by the Naics Code.\
To compute loan performance by business type, group by Naics Code, count the Loan Status in each group and then the percentages\
for each loan status within a group. Therefore, in each group the percentages must add to a 100 percent."
dr1.finding_sequence = observation_count
dr1.data_representation_observation_type = DataRepresentationTags.DATA_TRANSFORMATION_OBSERVATION.value
dr_obs_list.append(dr1)

In [8]:
dfnc = df.groupby("NaicsCode")["LoanStatus"].value_counts().reset_index()
dfnc["NaicsCode"] = dfnc.NaicsCode.astype(int)
dfnc["percentage"] = (100 * dfnc["count"]  / dfnc.groupby('NaicsCode')['count'].transform('sum')).round(2)

In [9]:
dfnc

Unnamed: 0,NaicsCode,LoanStatus,count,percentage
0,111110,CANCLD,16,44.44
1,111110,EXEMPT,14,38.89
2,111110,PIF,6,16.67
3,111120,CANCLD,1,33.33
4,111120,EXEMPT,1,33.33
...,...,...,...,...
3068,926120,CANCLD,2,100.00
3069,926130,EXEMPT,2,66.67
3070,926130,PIF,1,33.33
3071,926150,EXEMPT,1,100.00


In [10]:
dr2 = DataRepresentationObservation(namespace=onto)
dr2.finding = "The borrower state is captured by the BorrState attribute, see the data dictionary.\
To compute loan performance by BorrState,group by BorrState, count the Loan Status in each group and then the percentages\
for each loan status within a group. Therefore, in each group the percentages must add to a 100 percent."
observation_count += 1
dr2.finding_sequence = observation_count
dr2.data_representation_observation_type = DataRepresentationTags.DATA_TRANSFORMATION_OBSERVATION.value
dr_obs_list.append(dr2)

In [11]:
dfs = df.groupby("BorrState")["LoanStatus"].value_counts().reset_index()
dfs["percentage"] = (100 * dfs["count"]  / dfs.groupby('BorrState')['count'].transform('sum')).round(2)
dfs

Unnamed: 0,BorrState,LoanStatus,count,percentage
0,AK,EXEMPT,302,80.53
1,AK,CANCLD,42,11.20
2,AK,PIF,31,8.27
3,AL,EXEMPT,1046,79.85
4,AL,CANCLD,133,10.15
...,...,...,...,...
205,WV,CANCLD,66,9.59
206,WY,EXEMPT,244,71.55
207,WY,PIF,65,19.06
208,WY,CANCLD,29,8.50


In [12]:
kaw.has_data_representation_observations = dr_obs_list

In [13]:
KNOWLEDGE_BASE = "sba_7a_2023_perf_kb.xml"
onto.save(file=KNOWLEDGE_BASE, format="rdfxml")