<hr>

# README

* Please unzip file "df_trimmed.csv.zip" first before reading the file using pandas!

* Reason: the file size is 5.58 GB, it cannot be pushed to GitHub Repository even with GitHub Large File Storage (LFS) (limit 5.0 GB)

* Note: 'df_trimmed.csv' is included in .gitignore to prevent the file being pushed in the future.

<hr>

# FEATURES (Data definition):

https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields


<hr>

# DATASET & LIBRARIES

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("trimmed_dataset_fromStep1.csv")

In [None]:
df.shape

In [None]:
df.info()

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
df.head()

<hr>

# EXPLORATORY ANALYSIS (EDA)

In [None]:
df.columns

In [None]:
print(df['derived_sex'].value_counts(normalize=True))
df['derived_sex'].value_counts().plot(kind='bar')

In [None]:
counts = df['derived_sex'].value_counts()

plt.figure(figsize=(6,6))
plt.pie(
    counts.values,
    labels=counts.index,
    autopct='%1.1f%%',
    startangle=90
)

plt.title('Distribution of Derived Sex')
plt.axis('equal')
plt.show()

In [None]:
counts = df['applicant_sex'].value_counts()

plt.figure(figsize=(6,6))
plt.pie(
    counts.values,
    labels=counts.index,
    autopct='%1.1f%%',
    startangle=90
)

plt.title('Applicant Sex')
plt.axis('equal')
plt.show()

In [None]:
df['applicant_sex'].value_counts()

Values:
- 1 - Male
- 2 - Female
- 3 - Information not provided by applicant in mail, internet, or telephone application
- 4 - Not applicable
- 6 - Applicant selected both male and female

In [None]:
df['interest_rate'].describe()

In [None]:
df['interest_rate'].dtype


In [None]:
df['interest_rate'].isna().sum()

In [None]:
df['interest_rate'][df['interest_rate'].isna()]

In [None]:
df = df.dropna(subset=['interest_rate'])

In [None]:
df['interest_rate'].isna().sum()

In [None]:
df['interest_rate'].dtype

In [None]:
df['interest_rate'] = pd.to_numeric(df['interest_rate'], errors='coerce')

In [None]:
df['interest_rate'].dtype

In [None]:
df['interest_rate'].isna().sum()

In [None]:
df['interest_rate'][df['interest_rate'].isna()]

In [None]:
df = df.dropna(subset=['interest_rate'])

In [None]:
df['interest_rate'].isna().sum()

In [None]:
df['interest_rate'].describe().round(2)

In [None]:
df = df[df['interest_rate'] <= 30] #outliers

In [None]:
plt.figure(figsize=(6,4))
plt.boxplot(df['interest_rate'].dropna(), vert=True, patch_artist=True)
plt.title('Interest Rate Distribution')
plt.ylabel('Interest Rate (%)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:

plt.figure(figsize=(7, 4))
plt.hist(df['interest_rate'], bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Interest Rates')
plt.xlabel('Interest Rate (%)')
plt.ylabel('Frequency')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
df.columns

In [None]:
df['applicant_ethnicity-1'].value_counts()

In [None]:
counts = df['applicant_ethnicity-1'].value_counts()

plt.figure(figsize=(6,6))
plt.pie(
    counts.values,
    labels=counts.index,
    autopct='%1.1f%%',
    startangle=90
)

plt.title('Distribution of Applicant Ethnicity')
plt.axis('equal')
plt.show()

Values:
- 1 - Hispanic or Latino
- 11 - Mexican
- 12 - Puerto Rican
- 13 - Cuban
- 14 - Other Hispanic or Latino
- 2 - Not Hispanic or Latino
- 3 - Information not provided by applicant in mail, internet, or telephone application
- 4 - Not applicable

In [None]:
df['applicant_race-1'].value_counts()

In [None]:
df['applicant_race-1'].value_counts().plot(kind='bar')

Values:
- 1 - American Indian or Alaska Native
- 2 - Asian
- 21 - Asian Indian
- 22 - Chinese
- 23 - Filipino
- 24 - Japanese
- 25 - Korean
- 26 - Vietnamese
- 27 - Other Asian
- 3 - Black or African American
- 4 - Native Hawaiian or Other Pacific Islander
- 41 - Native Hawaiian
- 42 - Guamanian or Chamorro
- 43 - Samoan
- 44 - Other Pacific Islander
- 5 - White
- 6 - Information not provided by applicant in mail, internet, or telephone application
- 7 - Not applicable