## Category Levels with No Charge Offs

When we are encoding *high cardinality* category levels for a supervised learning task, some of these levels have only one value for the target attribute. In this discussion, these levels are called _pure_ levels . It does not make sense to include these levels into your final learning set because the label for this level is always a single value. To make this concrete, if you *always* see a "paid in full" when you see a loan record with a particular zip code (and you have sufficient number of this zip code in your data), then on the basis of the training data you have, the target for a loan with this zip code is "paid in full". A similar argument can be made for the other attributes. This kind of label prediction is very similar to the  [1 R algorithm](https://www.cs.waikato.ac.nz/~ihw/papers/95NM-GH-IHW-Develop.pdf). If you see this attribute you predict the label you always see with this attribute.

If you set aside the data that fits this kind of *level purity* then the remainder is what is a candidate for learning.

In [1]:
import pandas as pd
fptrain = "../../../data/sba_7a_loans_train.parquet"
fptest = "../../../data/sba_7a_loans_test.parquet"
df_train = pd.read_parquet(fptrain)
df_test = pd.read_parquet(fptest)
df = pd.concat([df_train, df_test])
df

Unnamed: 0,BorrName,BankFDICNumber,BankZip,BorrZip,NaicsCode,FranchiseCode,BusinessAge,LoanStatus,SBAGuaranteedApproval
0,Brothers Freight Management L,Not Applicable,87109,14580,484121.0,Not Applicable,Change of Ownership,PIF,3525000.0
1,EASY SPACE STORAGE LLC,58665,28403,65401,531130.0,Not Applicable,Change of Ownership,PIF,654750.0
2,H&W Endeavors Inc.,6560,43215,77493,449121.0,S0659,"Startup, Loan Funds will Open Business",PIF,150000.0
3,Imagine Technology Group LLC,4767,80202,85226,423420.0,Not Applicable,Existing or more than 2 years old,PIF,3052500.0
4,Zorn Fruherziehung LLC,33555,33880,2301,624410.0,Not Applicable,"Startup, Loan Funds will Open Business",PIF,187500.0
...,...,...,...,...,...,...,...,...,...
4588,IronPlane LLC,4255,4843,4101,454110.0,Not Applicable,Existing or more than 2 years old,PIF,79600.0
4589,ADORE HAIR & NAILS SALON LLC,17308,96813,96826,812112.0,Not Applicable,"Startup, Loan Funds will Open Business",CHGOFF,15000.0
4590,Sunberry Limited Manufacturing,6560,43215,48335,424990.0,Not Applicable,Existing or more than 2 years old,PIF,717750.0
4591,SNFood &amp; Beverage LLC,5304,54220,53023,312140.0,Not Applicable,Unanswered,PIF,175000.0


In [2]:
df["NaicsCode"] = df["NaicsCode"].apply(lambda x: x.replace(".0", ""))
df = df.drop(["BorrName"], axis=1)
dtypes_toset = {"BorrZip": 'str', "BankZip": "str", "BankFDICNumber": 'str',\
                "NaicsCode": 'str', "FranchiseCode": 'str', \
                "BusinessAge" : 'str', "LoanStatus": 'str'}

In [3]:
df = df.astype(dtypes_toset)

In [4]:
df_catvars = pd.DataFrame.from_dict({k: df[k].nunique() for k, v in dtypes_toset.items() if v in ['category', 'str']}, orient="index").reset_index()
df_catvars.columns = ["Attribute", "Unique_Values"]
df_catvars

Unnamed: 0,Attribute,Unique_Values
0,BorrZip,9057
1,BankZip,1169
2,BankFDICNumber,1116
3,NaicsCode,860
4,FranchiseCode,979
5,BusinessAge,6
6,LoanStatus,2


In [5]:
high_cardinality_attribs = ["BorrZip", "BankZip", "BankFDICNumber", "NaicsCode", "FranchiseCode"]
df.groupby("BorrZip").size()

BorrZip
10001    10
10002     1
10004     1
10005     3
10006     6
         ..
99709     1
99752     1
99801     1
99827     1
99835     1
Length: 9057, dtype: int64

In [6]:
df

Unnamed: 0,BankFDICNumber,BankZip,BorrZip,NaicsCode,FranchiseCode,BusinessAge,LoanStatus,SBAGuaranteedApproval
0,Not Applicable,87109,14580,484121,Not Applicable,Change of Ownership,PIF,3525000.0
1,58665,28403,65401,531130,Not Applicable,Change of Ownership,PIF,654750.0
2,6560,43215,77493,449121,S0659,"Startup, Loan Funds will Open Business",PIF,150000.0
3,4767,80202,85226,423420,Not Applicable,Existing or more than 2 years old,PIF,3052500.0
4,33555,33880,2301,624410,Not Applicable,"Startup, Loan Funds will Open Business",PIF,187500.0
...,...,...,...,...,...,...,...,...
4588,4255,4843,4101,454110,Not Applicable,Existing or more than 2 years old,PIF,79600.0
4589,17308,96813,96826,812112,Not Applicable,"Startup, Loan Funds will Open Business",CHGOFF,15000.0
4590,6560,43215,48335,424990,Not Applicable,Existing or more than 2 years old,PIF,717750.0
4591,5304,54220,53023,312140,Not Applicable,Unanswered,PIF,175000.0


## Zip Code Digit Interpretation
see the [wikipidea link](https://en.wikipedia.org/wiki/ZIP_Code#:~:text=ZIP%20Codes%20are%20numbered%20with,delivery%20addresses%20within%20that%20region.) for the zip code to see the interpretation of the digits. The first digit represents the region, the second and third digits represent the city, the fourth and fifth digits represent the delivery address location. Since using the full zip code gives us attributes with very high branching (cardinality) lead to values that have no generalizatblity (and hence cause overfitting), stopping with the third digit uses only the city information of the zip code. This gives us better generalization since zip codes that differ only in the last two digits get pooled together. So rather than using the Borrower Zip, we use the borrower city and *get a better feature*. It is the same story with *bank zip*

In [7]:
df["BorrZip"] = df["BorrZip"].apply(lambda x : str(x)[:3])
df["BankZip"] = df["BankZip"].apply(lambda x : str(x)[:3])

### Note: Reduction in Cardinality (Post Zip Code Feature Engineering)
After recoding the _BankZip_ and the _BorrZip_ the cardinality of these attributes drops a lot. Please review the original and recoded cardinalities of these attributes

In [8]:
df_catvars = pd.DataFrame.from_dict({k: df[k].nunique() for k, v in dtypes_toset.items() if v in ['category', 'str']}, orient="index").reset_index()
df_catvars.columns = ["Attribute", "Unique_Values"]
df_catvars

Unnamed: 0,Attribute,Unique_Values
0,BorrZip,814
1,BankZip,552
2,BankFDICNumber,1116
3,NaicsCode,860
4,FranchiseCode,979
5,BusinessAge,6
6,LoanStatus,2


In [9]:
dfg = df.groupby("BorrZip")["LoanStatus"].value_counts()

In [10]:
dfg

BorrZip  LoanStatus
100      PIF           84
         CHGOFF        10
102      PIF            2
103      PIF            9
         CHGOFF         2
                       ..
994      PIF            1
995      PIF           18
996      PIF            4
997      PIF            6
998      PIF            3
Name: count, Length: 1191, dtype: int64

## Outliers
We still have some high cardinality attributes with less than 5 instances per each unique value. The problem with these records is that:
1. We really can't test for generalization of these attributes because we don't have enough data with these values to split between training and test
2. We will overfit if we try to fit to data with this level of granularity, see for example, slide 17, in [this link](https://www.mimuw.edu.pl/~son/datamining/DM/5-decision%20tree.pdf)

So we consider these groups outliers. There are about 3.5K records in a dataset of about 23 K records with such behavior. We can analyze these as a separate group if need be and focus on the data with good generalization for the core model development.

In [11]:
high_cardinality_attribs = ["BorrZip", "BankZip", "BankFDICNumber", "NaicsCode", "FranchiseCode"]
DROP_THRESHOLD = 5
drop_these_records = {}
for attrib in high_cardinality_attribs:
    dfg = df.groupby(attrib, observed=False).size().reset_index()
    dfg.columns = [attrib, "group_size"]
    drop_these_records[attrib] = dfg[dfg.group_size < DROP_THRESHOLD][attrib].tolist()
    

In [12]:
for attrib in high_cardinality_attribs:
    df = df[~df[attrib].isin(drop_these_records[attrib])]


In [13]:
df.shape

(19516, 8)

## Pure Level Identification
This block identifies the category levels with a *single value* for the target attribute in that level and captures them in the `drop_these_records` dictionary. The `the_red_flags` dictionary captures the category levels that have a 100 percent charge off level. This we see is empty, while there are many category levels that flag good loans (paid in full).


In [14]:
drop_these_records = {}
the_red_flags = {}
for attrib in high_cardinality_attribs:
    dfg = df.groupby(attrib)["LoanStatus"].value_counts().reset_index()
    dfg["percentage"] = (100 * dfg["count"]  / dfg.groupby(attrib)['count'].transform('sum')).round(2)
    all_cases_chgoff = (dfg.LoanStatus == "CHGOFF") & (dfg.percentage == 100.00)
    the_red_flags[attrib] = dfg[all_cases_chgoff][attrib].tolist()
    all_cases_pif = (dfg.LoanStatus == "PIF") & (dfg.percentage == 100.00)
    all_cases = all_cases_chgoff | all_cases_pif
    drop_these_records[attrib] = dfg[all_cases][attrib].tolist()

## Category Levels that are 100 percent CHGOFF

In [15]:
the_red_flags

{'BorrZip': [],
 'BankZip': [],
 'BankFDICNumber': [],
 'NaicsCode': [],
 'FranchiseCode': []}

## Category Levels that are 100 percent PIF

In [16]:
for attrib in high_cardinality_attribs:
    df_pure_level = df[df[attrib].isin(drop_these_records[attrib])]

In [17]:
df_pure_level

Unnamed: 0,BankFDICNumber,BankZip,BorrZip,NaicsCode,FranchiseCode,BusinessAge,LoanStatus,SBAGuaranteedApproval
8,58458,802,794,442291,S0266,Change of Ownership,PIF,493500.0
17,3511,571,873,713940,S1744,"Startup, Loan Funds will Open Business",PIF,124950.0
60,6560,432,190,713940,S1744,"Startup, Loan Funds will Open Business",PIF,221760.0
81,32441,433,629,721110,S1645,Change of Ownership,PIF,1575000.0
144,32441,433,333,624120,S0810,"Startup, Loan Funds will Open Business",PIF,135000.0
...,...,...,...,...,...,...,...,...
4492,18609,793,809,713940,S0596,"Startup, Loan Funds will Open Business",PIF,213000.0
4499,57777,193,379,722515,S1533,"Startup, Loan Funds will Open Business",PIF,527400.0
4503,3890,716,782,713940,S0395,"Startup, Loan Funds will Open Business",PIF,176250.0
4521,3890,716,445,812199,S0585,"Startup, Loan Funds will Open Business",PIF,86445.0


## Drop the Pure PIF Category Levels

In [18]:
for attrib in high_cardinality_attribs:
    df_to_learn = df[~df[attrib].isin(drop_these_records[attrib])]

In [19]:
df_to_learn

Unnamed: 0,BankFDICNumber,BankZip,BorrZip,NaicsCode,FranchiseCode,BusinessAge,LoanStatus,SBAGuaranteedApproval
0,Not Applicable,871,145,484121,Not Applicable,Change of Ownership,PIF,3525000.0
1,58665,284,654,531130,Not Applicable,Change of Ownership,PIF,654750.0
3,4767,802,852,423420,Not Applicable,Existing or more than 2 years old,PIF,3052500.0
4,33555,338,230,624410,Not Applicable,"Startup, Loan Funds will Open Business",PIF,187500.0
5,27476,731,730,484121,Not Applicable,Existing or more than 2 years old,PIF,37910.0
...,...,...,...,...,...,...,...,...
4588,4255,484,410,454110,Not Applicable,Existing or more than 2 years old,PIF,79600.0
4589,17308,968,968,812112,Not Applicable,"Startup, Loan Funds will Open Business",CHGOFF,15000.0
4590,6560,432,483,424990,Not Applicable,Existing or more than 2 years old,PIF,717750.0
4591,5304,542,530,312140,Not Applicable,Unanswered,PIF,175000.0


## Cardinality Post Pure Level and Outlier Removal

In [20]:
df_catvars = pd.DataFrame.from_dict({k: df_to_learn[k].nunique() for k, v in dtypes_toset.items() if v in ['str']}, orient="index").reset_index()
df_catvars.columns = ["Attribute", "Unique_Values"]
df_catvars

Unnamed: 0,Attribute,Unique_Values
0,BorrZip,673
1,BankZip,356
2,BankFDICNumber,508
3,NaicsCode,493
4,FranchiseCode,43
5,BusinessAge,6
6,LoanStatus,2


In [21]:
df_to_learn["LoanStatus"].value_counts()

LoanStatus
PIF       17453
CHGOFF      894
Name: count, dtype: int64

In [22]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_to_learn, test_size=0.2)

## Weight of Evidence Encoding
Now that we have reduced the cardinality of the categorical attributes, we can apply a popular technique called _weight of evidence_ encoding to featurize the categorical attributes. _Weight of Evidence_ encoding is very popular in score card development for credit risk assesment. So this is a good featurization candidate for this dataset, **after** cardinality reduction. See [this article](https://ishanjainoffical.medium.com/understanding-weight-of-evidence-woe-with-python-code-cd0df0e4001e) for example for the details of calculation. This is available in the _category encoders_ package, so it is a simple enough implementation.

In [23]:
cols_to_encode = high_cardinality_attribs + ["BusinessAge"]
import category_encoders as ce
encoder = ce.WOEEncoder(cols=cols_to_encode )

In [24]:
df_train["LoanStatus"] = df_train["LoanStatus"].apply(lambda x: 0 if x == "PIF" else 1)
df_test["LoanStatus"] = df_test["LoanStatus"].apply(lambda x: 0 if x == "PIF" else 1)

In [25]:
df_train["LoanStatus"].value_counts() 

LoanStatus
0    13956
1      721
Name: count, dtype: int64

In [26]:
df_train = encoder.fit_transform(df_train, df_train["LoanStatus"])
df_test = encoder.transform(df_test)

In [27]:
fptrain = "../../../data/cleaned_sba_7a_loans_train.parquet"
fptest = "../../../data/cleaned_sba_7a_loans_test.parquet"
df_train.to_parquet(fptrain, index=False)
df_test.to_parquet(fptest, index=False)