# Milestone 1 - EDA and Preprocessing data 

***Important note*** - This is merely a template. you are recommended to create your own notebook from scratch.

> Make sure to include markdown-based text commenting and explaining each step you perform.

# 1 - Extraction

In [None]:
#Import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import requests
from bs4 import BeautifulSoup

In [None]:
data_dir = './Dataset/'

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
fintech_df = pd.read_csv(data_dir + 'fintech_data_43_52_0812.csv')

## Tidying up column names

In [None]:
def clean_column_names(df):
    df.columns = df.columns.str.lower().str.replace(' ', '_')

clean_column_names(fintech_df)

# 2- EDA

In [None]:
fintech_df.head()


In [None]:
fintech_df.info()
fintech_df_copy = fintech_df.copy()

### Question 1
*Do customers with higher income tend to opt for shorter or longer loan terms?*

In [None]:

#  Convert 'term' to integers (remove ' months' and convert to int)
fintech_df_copy['term_int'] = fintech_df_copy['term'].str.replace(' months', '').astype(int)

# Log-transform annual income (add a small constant to avoid log(0) issues)
fintech_df_copy['log_annual_inc'] = np.log1p(fintech_df_copy['annual_inc'])


# KDE plot for income distribution across loan terms
plt.figure(figsize=(10,6))
sns.kdeplot(data=fintech_df_copy[fintech_df_copy['term_int'] == 36], x='log_annual_inc', label='36 Months', fill=True)
sns.kdeplot(data=fintech_df_copy[fintech_df_copy['term_int'] == 60], x='log_annual_inc', label='60 Months', fill=True)
plt.title('Income Distribution for Different Loan Terms')
plt.xlabel('Annual Income')
plt.ylabel('Density')
plt.legend()
plt.show()


##### **Key Insights from the Plot**:
1. Both the 36-month and 60-month loan options have similar income distributions, with most customers having annual incomes around the same range. The peak (mode) of both curves is near the same income value, suggesting that most customers, regardless of loan term, have annual incomes in a similar range.
2. The 60-month loan term has a slightly higher density at the peak, indicating that more customers with average annual incomes (around the mode) choose the longer loan term.
4. Both curves taper off at higher incomes (right side of the graph). This shows that as income increases beyond a certain threshold, the preference for loan terms does not vary significantly.
Conclusion:
The general pattern shows a broad overlap between the two groups, with more customers opting for 60-month loans at average income levels.

### Question 2
*What are the most common reasons for applying for loans (Purpose), and how does the interest rate (Int Rate) differ across these purposes?Can we identify which purpose has the lowest interest rate?*

In [None]:
# View the unique values of the 'purpose' column
unique_purposes = fintech_df_copy['purpose'].unique()
print(unique_purposes)


In [None]:
# Calculate the most common loan purposes
top_purposes = fintech_df_copy['purpose'].value_counts().nlargest(10)  # Get top 10 most common purposes

# Filter the dataframe for only the most common purposes
filtered_df = fintech_df_copy[fintech_df_copy['purpose'].isin(top_purposes.index)]

# Plot interest rate distribution for different loan purposes
plt.figure(figsize=(12,8))
sns.boxplot(x='purpose', y='int_rate', data=filtered_df)
plt.xticks(rotation=45)
plt.title('Interest Rate Differences Across Loan Purposes')
plt.xlabel('Loan Purpose')
plt.ylabel('Interest Rate (%)')
plt.show()

# Calculate the average interest rate for each loan purpose
avg_interest_by_purpose = fintech_df_copy.groupby('purpose')['int_rate'].mean().reset_index()

# Find the purpose with the lowest average interest rate
lowest_int_rate_purpose = avg_interest_by_purpose.loc[avg_interest_by_purpose['int_rate'].idxmin()]

# Output the result
print("Loan purpose with the lowest interest rate:")
print(lowest_int_rate_purpose)


##### **Key Insights from the Plot**:

1. **credit_card** purpose has the **lowest median interest rate**, making it more favorable for borrowers in terms of cost.

2. **Small_business** loans tend to have the **highest median interest rate**, which might reflect the higher risk associated with business loans.

3. **Small_business** and **major_purchase** loans show wider variability in interest rates, indicating that interest rates for these purposes can vary greatly depending on the borrower’s profile.

**Conclusion:**
credit card-related loans have the most favorable interest rates, while small business loans are typically more expensive.

### Question 3
*Which states have the highest proportion of risky loans (loans graded lower, such as D, E, or F), and how does this correlate with the likelihood of default or late payment (Loan Status)? Can we identify geographic areas that might require stricter lending criteria?*


In [None]:
# View the unique values of the 'purpose' column
unique_status = fintech_df_copy['loan_status'].unique()
print(unique_status)


In [None]:
# Map the numeric grades to their corresponding letter grades
def map_grade(numeric_grade):
    if 1 <= numeric_grade <= 5:
        return 'A'
    elif 6 <= numeric_grade <= 10:
        return 'B'
    elif 11 <= numeric_grade <= 15:
        return 'C'
    elif 16 <= numeric_grade <= 20:
        return 'D'
    elif 21 <= numeric_grade <= 25:
        return 'E'
    elif 26 <= numeric_grade <= 30:
        return 'F'
    elif 31 <= numeric_grade <= 35:
        return 'G'

fintech_df_copy['letter_grade'] = fintech_df_copy['grade'].apply(map_grade)

# Filter the data to include only risky loans (grades D, E, or F)
risky_grades = ['D', 'E', 'F', 'G']
risky_loans_df = fintech_df_copy[fintech_df_copy['letter_grade'].isin(risky_grades)]

# Calculate the proportion of risky loans by state
risky_loans_by_state = risky_loans_df.groupby('state').size() / fintech_df_copy.groupby('state').size()
risky_loans_by_state = risky_loans_by_state.dropna()

# Plot the proportion of risky loans by state
plt.figure(figsize=(12,8))
risky_loans_by_state.sort_values(ascending=False).plot(kind='bar')
plt.title('Proportion of Risky Loans (Grades D, E, F, G) by State')
plt.xlabel('State')
plt.ylabel('Proportion of Risky Loans')
plt.show()

# Analyze correlation between loan status ("Charged Off" or "Late") and risky grades by state
risky_status_df = risky_loans_df[risky_loans_df['loan_status'].isin(['Charged Off', 'Late (31-120 days)', 'Late (16-30 days)'])]

# Calculate proportion of risky loans that are "Charged Off" or "Late" by state
risky_status_by_state = risky_status_df.groupby('state').size() / risky_loans_df.groupby('state').size()
risky_status_by_state = risky_status_by_state.dropna()

# Plot the proportion of risky loans with default/late payment by state
plt.figure(figsize=(12,8))
risky_status_by_state.sort_values(ascending=False).plot(kind='bar', color='red')
plt.title('Proportion of Risky Loans (Grades D, E, F, G) with Default or Late Payment by State')
plt.xlabel('State')
plt.ylabel('Proportion of Default/Late Payments')
plt.show()


##### **Key Insights from the Plots:**

1. **Proportion of Risky Loans (First Plot)**:
   - **West Virginia (WV)** and **Washington D.C. (DC)** have the **highest proportion** of risky loans (grades D, E, F, G).
   - **North Dakota (ND)** has the **lowest proportion** of risky loans.

2. **Risk of Default or Late Payment (Second Plot)**:
   - **South Dakota (SD)** and **Kansas (KS)** have the **highest proportion** of risky loans that result in **default or late payment**.
   - States like **New Hampshire (NH)** and **Montana (MT)** have the **lowest default/late payment rates** for risky loans.

**Conclusion:**
- States like **WV** and **DC** have a high concentration of risky loans, but **SD** and **KS** show the highest likelihood of default or late payment. These states may benefit from **stricter lending criteria** to mitigate risk.

### Question 4
*Are customers in payment plans more likely to have loans that are "Current" or "Late"?*

In [None]:
# Group the data by payment plan status and loan status
payment_plan_status = fintech_df_copy.groupby(['pymnt_plan', 'loan_status']).size().unstack()

# Normalize the data by the total number of customers in each group (convert to proportions)
payment_plan_status = payment_plan_status.div(payment_plan_status.sum(axis=1), axis=0)

# Plot the proportion of loan statuses for customers in payment plans vs not
payment_plan_status[['Current', 'Late (31-120 days)', 'Late (16-30 days)']].plot(kind='bar', stacked=True, figsize=(10, 6), color=['green', 'red', 'orange'])
plt.title('Proportion of Loan Status ("Current" vs "Late") by Payment Plan Status')
plt.xlabel('Payment Plan Status')
plt.ylabel('Proportion of Loans')
plt.xticks(rotation=0)
plt.legend(title='Loan Status')
plt.show()


##### **Key Insights from the Graph:**

1. **Customers Not in Payment Plans**:
   - A **majority of loans** are **current**.
   - Only a **small proportion** of loans are **late**.

2. **Customers in Payment Plans**:
   - A **significant portion of loans** are either **late 31-120 days** or **late 16-30 days**.
   - **No loans** are **current** for customers in payment plans.

**Conclusion:** Customers in payment plans are **much more likely to have loans that are late**, whereas customers not in payment plans are **more likely to have current loans**.This suggests that being in a payment plan correlates with a higher likelihood of delayed payments.

### Question 5
*Are individual borrowers more likely to experience late payments as interest rates increase compared to joint borrowers?*

In [None]:
type_unique = fintech_df_copy['type'].unique()
print(type_unique)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Normalize the 'type' field to categorize borrowers
fintech_df_copy['borrower_type'] = fintech_df_copy['type'].replace({
    'Individual': 'Individual',
    'INDIVIDUAL': 'Individual',
    'Joint App': 'Joint',
    'JOINT': 'Joint',
    'DIRECT_PAY': 'Direct Pay'  # If you want to handle Direct Pay separately
})

# Filter for relevant loan statuses (Charged Off, Late)
loan_status_filtered = fintech_df_copy[fintech_df_copy['loan_status'].isin(['Charged Off', 'Late (31-120 days)', 'Late (16-30 days)'])]

# Separate individual and joint borrowers
individual_borrowers = loan_status_filtered[fintech_df_copy['borrower_type'] == 'Individual']
joint_borrowers = loan_status_filtered[fintech_df_copy['borrower_type'] == 'Joint']

# Plot interest rates vs loan status for individual borrowers
plt.figure(figsize=(12,6))
sns.boxplot(x='loan_status', y='int_rate', data=individual_borrowers)
plt.title('Interest Rate vs Loan Status (Individual Borrowers)')
plt.xlabel('Loan Status')
plt.ylabel('Interest Rate')
plt.show()

# Plot interest rates vs loan status for joint borrowers
plt.figure(figsize=(12,6))
sns.boxplot(x='loan_status', y='int_rate', data=joint_borrowers)
plt.title('Interest Rate vs Loan Status (Joint Borrowers)')
plt.xlabel('Loan Status')
plt.ylabel('Interest Rate')
plt.show()

# Compare the default and late payment rates for individual vs joint borrowers
plt.figure(figsize=(10,6))
sns.histplot(individual_borrowers['int_rate'], label='Individual Borrowers', color='blue', kde=True)
sns.histplot(joint_borrowers['int_rate'], label='Joint Borrowers', color='orange', kde=True)
plt.title('Interest Rate Distribution for Late Loans (Individual vs Joint Borrowers)')
plt.xlabel('Interest Rate')
plt.ylabel('Density')
plt.legend()
plt.show()


##### **Key Insights:**

1. **Individual Borrowers**: The **interest rates** for **"Charged Off"** loans tend to be slightly higher compared to late payments (16-30 days and 31-120 days).

2. **Joint Borrowers**: The overall interest rates for joint borrowers across statuses (Late or Charged Off) tend to cluster around higher values than individual borrowers.

3. **Comparison**:
   - The histogram shows that **individual borrowers** tend to be late across a **wider range of interest rates**, especially in the mid-range (0.12 - 0.20).
   - **Joint borrowers**, by contrast, seem to have a **tighter distribution** with higher interest rates but fewer overall late payments, suggesting **joint borrowers may be less prone to late payments** at the same interest rates compared to individuals.

**Conclusion:**
- **Individual borrowers** are more likely to experience late payments over a wider range of interest rates.
- **Joint borrowers**, although charged slightly higher interest rates, tend to pay late less frequently than individual borrowers, especially in the mid-interest range.

### Question 6
*Does the length of employment (Emp Length) correlate with loan default risk (Loan Status)? Do borrowers with longer employment histories receive better loan grades?*

In [None]:
emp_length_unique = fintech_df_copy['emp_length'].unique()
print(emp_length_unique)

# Clean 'emp_length' to ensure it's numeric
emp_length_mapping = {
    '10+ years': 10,
    '< 1 year': 0.5,
    '1 year': 1,
    '2 years': 2,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    'n/a': None
}
fintech_df_copy['emp_length_clean'] = fintech_df_copy['emp_length'].replace(emp_length_mapping)

# Part 1: Analyze Loan Default Risk by Employment Length

# Print unique loan status values to see the exact names
loan_status_unique = fintech_df_copy['loan_status'].unique()
print(loan_status_unique)

# Filter for relevant loan statuses (Charged Off, Late, Fully Paid)
loan_status_filtered = fintech_df_copy[fintech_df_copy['loan_status'].isin(['Fully Paid', 'Charged Off', 'Late (31-120 days)', 'Late (16-30 days)'])]

# Plot the distribution of loan statuses across employment length
plt.figure(figsize=(12,6))
sns.boxplot(x='loan_status', y='emp_length_clean', data=loan_status_filtered)
plt.title('Employment Length vs Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Employment Length (Years)')
plt.show()

# Part 2: Analyze Loan Grades by Employment Length

plt.figure(figsize=(12,6))
sns.boxplot(x='letter_grade', y='emp_length_clean', data=fintech_df_copy)
plt.title('Employment Length vs Loan Grade')
plt.xlabel('Loan Grade')
plt.ylabel('Employment Length (Years)')
plt.show()


##### **Key Insights**

**Insights from Employment Length vs Loan Status:**
Borrowers with **Late (16-30 days)** tend to have slightly **shorter employment lengths**, but overall, there is no drastic difference between late payments and the other categories.

**Insights from Employment Length vs Loan Grade:**
   - Employment length appears to be fairly consistent across loan grades (A to G).
   - Borrowers with longer employment do not necessarily receive better grades, as the **median employment length** is similar across all grades.

**Conclusion:** Longer employment does not significantly reduce the risk of loan defaults or late payments. Employment length does not appear to heavily influence the loan grade assigned to a borrower.

# 3 - Cleaning Data

The column names are cleaned at the beginning to facilitate easier exploratory data analysis (EDA) on the dataset. Next, the `customer_id` column, which is unique for each customer, will be set as the index.

In [None]:
def set_df_index(df, index_col):
    df = df.set_index(index_col, inplace=True)
    return df

set_df_index(fintech_df, 'customer_id')
fintech_df.head()


## Observe inconsistent data

In [None]:
def print_unique_values(fintech_df):
    # Loop through non-numeric columns and print the unique values for each
    for column in fintech_df.select_dtypes(exclude=['float64']).columns:
        unique_values = fintech_df[column].unique()
        print(f"Unique values in '{column}':")
        print(unique_values)
        print("\n")

print_unique_values(fintech_df)

##### Observation of Inconsistent Data:
`type` Field (Inconsistent Capitalization):
There are duplicate representations of the same value, such as 'Individual' and 'INDIVIDUAL', and similarly, 'Joint App' and 'JOINT'.
Action: Normalize the values to ensure consistent representation.

`emp_length` Field (Inconsistent Representations):
The values '10+ years', '2 years', and '< 1 year' represent employment length in different formats.
Action: Standardize the employment length field by converting these into numeric values (e.g., '10+ years' → 10, '< 1 year' → 0.5).

`home_ownership` Field:
There is an unusual value 'ANY', which could be considered irrelevant or a data entry error, as it doesn't seem to align with traditional categories like 'OWN', 'RENT', and 'MORTGAGE'.
Action: Investigate further to determine if 'ANY' is valid or should be removed/recategorized.


In [None]:
def normalize_type_field(df):
    df['type'] = df['type'].replace({
        'Individual': 'Individual',
        'INDIVIDUAL': 'Individual',
        'Joint App': 'Joint',
        'JOINT': 'Joint',
        'DIRECT_PAY': 'Direct Pay'
    })

normalize_type_field(fintech_df)

In [None]:
def clean_emp_length(df):
    emp_length_mapping = {
        '10+ years': 10,
        '< 1 year': 0.5,
        '1 year': 1,
        '2 years': 2,
        '3 years': 3,
        '4 years': 4,
        '5 years': 5,
        '6 years': 6,
        '7 years': 7,
        '8 years': 8,
        '9 years': 9,
        'n/a': None
    }
    df['emp_length'] = df['emp_length'].replace(emp_length_mapping)

clean_emp_length(fintech_df)


In [None]:
fintech_df[fintech_df['home_ownership'] == 'ANY'].head()


After reviewing the data, 'ANY' seems to be a valid value in the `home_ownership` field, so we leave it as is without replacing or dropping it.

## Findings and conclusions

This process involves normalizing, merging, and standardizing values for consistency, which will improve the quality and reliability of any analysis performed on the dataset.

Create the Lookup Dataframe to use later

In [None]:
def create_lookup_df():
    lookup_df = pd.DataFrame(columns=['column', 'original', 'imputed'])
    return lookup_df

lookup_df = create_lookup_df()

In [None]:
def add_lookup_values(lookup_df, column_name, original_column, encoded_column):
    unique_values = original_column.unique()
    unique_encoded_values = encoded_column.unique()
    new_rows = pd.DataFrame({
        'column': column_name,
        'original': unique_values,
        'imputed': unique_encoded_values,
    })
    lookup_df = pd.concat([lookup_df, new_rows], ignore_index=True)

    return lookup_df

## Observing Missing Data

In [None]:
perc_null = fintech_df.isnull().mean() * 100
print("Percentage of Missing Values in Each Column:")
perc_null

The only columns that are missing are `annual_inc_joint`, `emp_title`, `emp_length`, `int_rate`, `description`.

In [None]:
print(fintech_df[fintech_df['annual_inc_joint'].isnull()]['type'].unique())

The `annual_inc_joint` is missing when the type is either 'Individual' or 'Direct Pay'. This makes sense because joint income is only relevant for joint borrowers, and thus this field should not have values when the borrower type is not 'Joint'. 

----

Now we move on to observe if there is a pattern in the missingness of `emp_title`.

In [None]:
fintech_df_copy = fintech_df.copy()

# Step 1: Create bins for annual income to group the salary into ranges
salary_bins = [0, 25000, 50000, 75000, 100000, 150000, 200000, 300000, fintech_df['annual_inc'].max()]
salary_labels = ['<25k', '25k-50k', '50k-75k', '75k-100k', '100k-150k', '150k-200k', '200k-300k', '>300k']
fintech_df_copy['income_range'] = pd.cut(fintech_df['annual_inc'], bins=salary_bins, labels=salary_labels)

# Step 2: Create a new column to indicate whether emp_title is missing
fintech_df_copy['emp_title_missing'] = fintech_df_copy['emp_title'].isna()

# Step 3: Plot the proportion of missing emp_title in each salary range
plt.figure(figsize=(10,6))
sns.barplot(x='income_range', y='emp_title_missing', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Employee Title by Salary Range')
plt.xlabel('Annual Income Range')
plt.ylabel('Proportion of Missing Employee Title')
plt.xticks(rotation=45)
plt.show()


The missingness of `emp_title` is likely **MAR**, as it is related to the salary of the borrowers. Based on the plot, the missing values for the `emp_title` column seem to be concentrated in the lower salary ranges, particularly in the <25k range. The logical explanation for this pattern could be due to individuals in lower salary ranges being employed in unstable or informal jobs where titles are less defined (e.g., part-time jobs, contract work, or temporary positions). As a result, they may be less likely to report a formal job title because they either do not have one or the job title does not seem relevant or prestigious enough to report.

-----

Now we move on to observe if there is a pattern in the missingness of `emp_length`.

In [None]:
# Step 1: Investigate relationship with salary (annual_inc)

# Create a column to indicate if emp_length is missing
fintech_df_copy['emp_length_missing'] = fintech_df_copy['emp_length'].isna()

# SVisualize the proportion of missing emp_length by income range
plt.figure(figsize=(10,6))
sns.barplot(x='income_range', y='emp_length_missing', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Employment Length by Salary Range')
plt.xlabel('Annual Income Range')
plt.ylabel('Proportion of Missing Employment Length')
plt.xticks(rotation=45)
plt.show()

# Investigate relationship with employment title (emp_title)
plt.figure(figsize=(10,6))
sns.barplot(x='emp_title_missing', y='emp_length_missing', data=fintech_df_copy)
plt.title('Missing Employment Length vs Missing Employment Title')
plt.xlabel('Is Employment Title Missing?')
plt.ylabel('Proportion of Missing Employment Length')
plt.show()

The missingness of `emp_length` seems to follow a **MAR** pattern based on the visualizations. 
From the first plot, we can see that a higher proportion of missing `emp_length` occurs in the lower salary ranges, particularly in the <25k range. People in low-paying jobs might have unstable or irregular employment histories, making them less likely to report their employment length. These borrowers may have short-term jobs, part-time roles, or be employed in positions where disclosing employment tenure is less common or relevant. Borrowers with lower incomes may also avoid providing detailed employment information because they perceive it as not important or because they are self-employed without a formal job duration.

The second plot shows that when the `emp_title` is missing, the emp_length is always missing as well.
This is likely because both fields are closely related: if a borrower does not provide their job title (perhaps due to informal or unstable work), it makes sense that they would also omit their employment length. People who leave their job title blank may not want to disclose their job history either.

In conclusion, the missingness of emp_length is **MAR** because it appears to be dependent on observed variables, like low income and missing job title. The logical reasoning is that borrowers in lower salary ranges, or those with informal/unstable jobs, are less likely to provide employment information (both title and length).

---

Now we move on to observe if there is a pattern in the missingness of `int_rate`.

In [None]:
# Check for missing values in 'int_rate'
missing_int_rate = fintech_df['int_rate'].isna()

fintech_df_copy['missing_int_rate'] = missing_int_rate

# Step 1: Visualize missing int_rate by loan term
plt.figure(figsize=(10,6))
sns.barplot(x='term', y='missing_int_rate', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Interest Rate by Loan Term')
plt.xlabel('Loan Term (Months)')
plt.ylabel('Proportion of Missing Interest Rate')
plt.xticks(rotation=45)
plt.show()

# Step 2: Visualize missing int_rate by loan status
plt.figure(figsize=(10,6))
sns.barplot(x='loan_status', y='missing_int_rate', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Interest Rate by Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Proportion of Missing Interest Rate')
plt.xticks(rotation=45)
plt.show()

fintech_df_copy['letter_grade'] = fintech_df_copy['grade'].apply(map_grade)

# Step 3: Visualize missing int_rate by loan grade
plt.figure(figsize=(10,6))
sns.barplot(x='letter_grade', y='missing_int_rate', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Interest Rate by Loan Grade')
plt.xlabel('Loan Grade')
plt.ylabel('Proportion of Missing Interest Rate')
plt.xticks(rotation=45)
plt.show()

# Step 4: Visualize missing int_rate by funded amount range
# Create bins for funded amount
# Checking the maximum value of funded_amount
max_funded = fintech_df_copy['funded_amount'].max()
# Set bins based on the maximum value of funded_amount
if max_funded < 50000:
    funded_bins = [0, 5000, 10000, 20000, max_funded]
    funded_labels = ['<5k', '5k-10k', '10k-20k', f'20k-{int(max_funded)}k']
else:
    funded_bins = [0, 5000, 10000, 20000, 50000, max_funded]
    funded_labels = ['<5k', '5k-10k', '10k-20k', '20k-50k', f'>50k']

# Now apply the binning safely
fintech_df_copy['funded_range'] = pd.cut(fintech_df_copy['funded_amount'], bins=funded_bins, labels=funded_labels, include_lowest=True)

plt.figure(figsize=(10,6))
sns.barplot(x='funded_range', y='missing_int_rate', data=fintech_df_copy, estimator=lambda x: sum(x) / len(x))
plt.title('Proportion of Missing Interest Rate by Funded Amount Range')
plt.xlabel('Funded Amount Range')
plt.ylabel('Proportion of Missing Interest Rate')
plt.xticks(rotation=45)
plt.show()


The missing `int_rate` values are likely **MCAR** because no clear pattern or correlation was observed between the missingness and other variables in the dataset (such as loan term, loan grade, or funded amount). This suggests that the missing interest rates are unrelated to both observed data and the actual interest rate values. They were likely missed due to random chance, possibly during the data entry process, such as when filling out the form.

---

Now we move on to observe if there is a pattern in the missingness of `description`.

The missing values in the `description` field are likely **MCAR**. Since this field is **optional**, borrowers can choose whether or not to provide a description, and the missingness is not systematically related to other variables. It likely occurs due to borrower preference, without any underlying pattern or bias in the data.

## Handling Missing data

In [None]:
# Generic univariate imputation function
def univariate_imputation(df, column, fill_value):
    df[column].fillna(fill_value, inplace=True)

# Generic multivariate imputation function
def multivariate_imputation(df, column_to_impute, group_by_column, method='mode'):
    if method == 'mode':
        df[column_to_impute] = df.groupby(group_by_column)[column_to_impute].transform(
            lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown')
        )
    elif method == 'median':
        global_median = df[column_to_impute].median()  # Calculate global median
        df[column_to_impute] = df.groupby(group_by_column)[column_to_impute].transform(
            lambda x: x.fillna(x.median() if not x.dropna().empty else global_median)
        )

def null_values_sum(df,column):
    return df[column].isnull().sum()


`annual_inc_joint`: Since the missing values are logically tied to borrower type, we do not need to impute these values because it makes sense for them to be missing.

We can simply replace the null values with 0 to avoid any null values disrupting the calculations.

In [None]:
def update_lookup_df(lookup_df, column_name, original_value, imputed_value):
    lookup_df = pd.concat([lookup_df, pd.DataFrame([{'column': column_name, 'original': original_value, 'imputed': imputed_value}])], ignore_index=True)
    return lookup_df

In [None]:
univariate_imputation(fintech_df, 'annual_inc_joint', 0)
lookup_df = update_lookup_df(lookup_df, 'annual_inc_joint', 'missing', 0)

In [None]:
null_values_sum(fintech_df,'annual_inc_joint')

`emp_title` and `emp_length`: These are both MAR based on the salary ranges. We can impute these missing values ussing the median or mode based on salary ranges.


In [None]:
def impute_emp_fields(df):
    # Define salary ranges without creating a new column
    salary_bins = [0, 25000, 50000, 75000, 100000, 150000, 200000, 300000, df['annual_inc'].max()]
    salary_labels = ['<25k', '25k-50k', '50k-75k', '75k-100k', '100k-150k', '150k-200k', '200k-300k', '>300k']
    salary_groups = pd.cut(df['annual_inc'], bins=salary_bins, labels=salary_labels)

    # Impute 'emp_title' using mode within salary ranges
    multivariate_imputation(df, 'emp_title', salary_groups, method='mode')

    # Impute 'emp_length' using median within salary ranges
    multivariate_imputation(df, 'emp_length', salary_groups, method='median')

impute_emp_fields(fintech_df)

In [None]:
null_values_sum(fintech_df,'emp_title')


In [None]:
null_values_sum(fintech_df,'emp_length')

`int_rate`: Since the missing values are MCAR, simple imputation methods like mean or median can be used without introducing bias.

In [None]:
mean = fintech_df['int_rate'].mean()
univariate_imputation(fintech_df, 'int_rate', mean)
lookup_df = update_lookup_df(lookup_df, 'int_rate', 'missing', mean)

In [None]:
null_values_sum(fintech_df,'int_rate')

`description`: The missing values are also MCAR, meaning they were omitted optionally by the borrowers. Since the description is not a critical feature for numeric processing, we can simply fill missing values with a placeholder such as 'No Description'.

In [None]:
univariate_imputation(fintech_df, 'description', 'No Description')
lookup_df = update_lookup_df(lookup_df, 'description', 'missing', 'No Description')

In [None]:
null_values_sum(fintech_df,'description')

## Observing outliers

In [None]:
def detect_outliers(df, col, method='Z-Score', threshold=3):
    
    if method == 'Z-Score':
        z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
        df['z_score'] = z_scores
        z_outliers_mask = df['z_score'] > threshold
        df.drop(columns='z_score', inplace=True)
        outliers = df[z_outliers_mask] 
    elif method == 'IQR':
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        print(f'Outliers below: {Q1 - 1.5 * IQR:.3f}')
        print(f'Outliers above: {Q3 + 1.5 * IQR:.3f}')
        
        iqr_outliers_mask = (df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))
        outliers = df[iqr_outliers_mask]

    print(f'Percentage of Outliers: {len(outliers)/len(df)*100:.3f}%')

        
    return outliers

def plot_distribution(df,col):
    sns.histplot(df[col], kde=True)

In [None]:
numeric_cols = fintech_df.select_dtypes(include=['float64', 'int64']).columns
print(f"Numeric Columns: {numeric_cols}")


Let's start with `emp_length` column

In [None]:
plot_distribution(fintech_df,'emp_length')

The distribution of `emp_length` appears to be **right-skewed** based on the plot, with a long tail on the right side, especially toward the 10+ years range.

Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.

In [None]:
detect_outliers(fintech_df, 'emp_length', method='IQR')

Based on the analysis using the IQR method for detecting outliers in `emp_length`, it was found that **0% of the data points were outliers**. This indicates that despite the skewness observed in the distribution, the values of `emp_length` do not fall outside the acceptable range for outliers using the IQR criterion.

---

Now let's observe `annual_inc` column

In [None]:
plot_distribution(fintech_df,'annual_inc')
plt.xlim(0, 800000)


The distribution of `annual_inc` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
outliers = detect_outliers(fintech_df, 'annual_inc', method='IQR')
len(outliers)

For the `annual_inc` column outliers below -25,000 and above 167,000 were identified, accounting for 5.18% of the data points.

The outliers above 167,000 likely represent high-income earners whose annual income is significantly above the typical range of the dataset.
The outliers below -25,000 could be due to incorrect or erroneous data entries, as negative annual incomes do not make practical sense.
This percentage suggests that a moderate portion of the data falls outside the expected range for annual income. Depending on the analysis goals, these outliers need to be addressed to avoid skewing the results.

----

Now let's observe `annual_inc_joint` column

In [None]:
plot_distribution(fintech_df,'annual_inc_joint')

The distribution of `annual_inc_joint` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'annual_inc_joint', method='IQR')

For the `annual_inc_joint` column, the outliers detected below **0.000** and above **0.000** account for **6.85%** of the data. However, it's important to note that approximately **94%** of this column was originally null, and we imputed these missing values with **0** because it made sense to do so for non-Joint borrowers.

Therefore, the **6.85%** of data points flagged as outliers are not actually outliers in the true sense. Instead, they represent legitimate cases where the `annual_inc_joint` field was appropriately filled for borrowers who applied for Joint loans. These values are not problematic and should not be treated as outliers. This reinforces that the imputation decision was valid for this context.

Let's observe if there is any outliers in the values that are not equal to **0**

In [None]:
filtered_df = fintech_df[fintech_df['annual_inc_joint'] > 0]

plot_distribution(filtered_df,'annual_inc_joint')

The distribution of `annual_inc_joint` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(filtered_df, 'annual_inc_joint', method='IQR')

The outliers for `annual_inc_joint` values greater than zero fall below -15,375 and above 265,225, representing 4.268% of the non-zero data. Given the nature of this field and its relation to joint loans, these outliers likely represent extreme or uncommon cases in borrower income, which could be considered for further transformation to reduce their impact.

----

Now let's observe `avg_cur_bal` column

In [None]:
plot_distribution(fintech_df,'avg_cur_bal')

The distribution of `avg_cur_bal` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'avg_cur_bal', method='IQR')

For the `avg_cur_bal` column, we detected outliers as follows:

- **Outliers below -20,667.75** and above **42,820.25**: The presence of outliers below a negative value (which is not possible for a balance) may indicate errors in the data, as balances typically cannot be negative beyond certain limits, and especially not to such a large extent.
- **5.516%** of the data is flagged as outliers, suggesting that a notable proportion of customers have significantly higher or lower average current balances compared to the majority.

The high percentage of outliers could indicate that balances in this dataset are spread across a wide range, with certain customers having extremely high or low balances.

----

Now let's observe `tot_cur_bal` column

In [None]:
plot_distribution(fintech_df,'tot_cur_bal')

The distribution of `tot_cur_bal` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'tot_cur_bal', method='IQR')

For the `tot_cur_bal` column, we detected outliers as follows:
- **3.43%** of the data is identified as outliers, which means a small but significant portion of the data has either very high or very low total balances.


Now let's observe `loan_amount` column

In [None]:
plot_distribution(fintech_df,'loan_amount')

The distribution of `loan_amount` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'loan_amount', method='IQR')

For the `loan_amount` column, we detected the following outliers:

- **Outliers below -10,890.63** and above **39,484.38**: These thresholds suggest that any loan amount outside this range is considered an outlier. While negative loan amounts are not plausible in real-world scenarios.

- **2.44%** of the data is flagged as outliers, which is a relatively small percentage. However, given the nature of loan data, it is important to carefully review any loans outside this range, particularly the negative values, as they may distort any financial analysis or modeling.

This percentage is manageable, but it's critical to address these outliers to ensure the accuracy and integrity of the dataset for further analysis.

---

Now let's observe `funded_amount` column

In [None]:
plot_distribution(fintech_df,'funded_amount')

The distribution of `funded_amount` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'funded_amount', method='IQR')

For the `funded_amount` column, we detected the following outliers:

- **Outliers below -10,890.63** and above **39,484.38**: These thresholds suggest that any loan amount outside this range is considered an outlier. While negative loan amounts are not plausible in real-world scenarios.

- **2.44%** of the data is flagged as outliers, which is a relatively small percentage. However, given the nature of loan data, it is important to carefully review any loans outside this range, particularly the negative values, as they may distort any financial analysis or modeling.

This percentage is manageable, but it's critical to address these outliers to ensure the accuracy and integrity of the dataset for further analysis.

---

Now let's observe `int_rate` column

In [None]:
plot_distribution(fintech_df,'int_rate')

The distribution of `int_rate` appears to be **right-skewed** based on the plot. Given that the data is skewed and does not follow a normal distribution, IQR would be the more appropriate method for detecting outliers in this case. Z-score would not be suitable since it assumes a normal distribution, which is not true for this dataset.


In [None]:
detect_outliers(fintech_df, 'int_rate', method='IQR')

For the `int_rate` column, the following outliers were detected:

- **Outliers below 0.00% and above 25.3%**: This indicates that any interest rates exceeding 25.3% are considered outliers. Since 0% interest rates are unlikely in most lending situations, any value at or below 0% is also flagged, possibly indicating erroneous entries or special cases.

- **2.29%** of the interest rate data is flagged as outliers. This is a small proportion of the overall data, which suggests that the majority of interest rates fall within a reasonable range, but these outliers could represent either high-risk loans, special loan agreements, or errors.

It’s essential to review these outliers to verify if they are valid cases or if any corrections need to be made, especially for values near 0%, which may skew analysis if left unaddressed.

## Handling outliers

Handling outliers in the `annual_inc`

Log transformation is a common technique used to manage outliers, especially in datasets that are positively skewed. Therefore we will apply the log transform on the `annual_inc` column

In [None]:
# Generic function for log transformation
def get_log_transformation(df, column):
    return np.log(df[column])


In [None]:
def compare_distributions(df, col, new_col):
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))

    sns.histplot(df[col],ax=ax[0], kde=True);
    ax[0].set_title('Original Data');

    sns.histplot(new_col, ax=ax[1], kde=True);
    ax[1].set_title('Outliers Handled Data');

In [None]:
log_annual_inc = get_log_transformation(fintech_df, 'annual_inc')
compare_distributions(fintech_df, 'annual_inc', log_annual_inc)

In [None]:
def apply_transformation(df, col, transformed_col, ignore_zero=False):
    if ignore_zero:
        df.loc[df[col] != 0, col] = transformed_col
    else:
        df[col] = transformed_col
    return df

In [None]:
apply_transformation(fintech_df, 'annual_inc', log_annual_inc)
outliers = detect_outliers(fintech_df, 'annual_inc', method='IQR')
len(outliers)


After applying the log transformation to `annual_inc`, the distribution became more normal, reducing outliers from **5.18%** to **1.95%**. This shows the transformation effectively addressed skewness and minimized extreme values.

----

Handling outliers in the `annual_inc_joint` values > 0

We will apply the log transform as it is positively skewed

In [None]:
filtered_df = fintech_df[fintech_df['annual_inc_joint'] != 0]
log_annual_inc_joint = get_log_transformation(filtered_df, 'annual_inc_joint')
compare_distributions(filtered_df, 'annual_inc_joint', log_annual_inc_joint)

In [None]:
fintech_df = apply_transformation(fintech_df, 'annual_inc_joint', log_annual_inc_joint, ignore_zero=True)
outliers = detect_outliers(fintech_df[fintech_df['annual_inc_joint'] != 0], 'annual_inc_joint', method='IQR')
len(outliers)

outliers = detect_outliers(fintech_df[fintech_df['annual_inc_joint'] != 0], 'annual_inc_joint', method='IQR')
len(outliers)


After applying the log transformation to `annual_inc_joint` values > 0 and handling the outliers, the percentage of outliers has decreased to 1.405%, with outliers occurring below 10.540 and above 12.848. This shows a significant improvement in reducing the impact of extreme values.

---

Handling outliers in the `avg_cur_bal`

We will apply the log transform on the `avg_cur_bal` column as it is positively skewed

In [None]:
log_avg_cur_bal = get_log_transformation(fintech_df, 'avg_cur_bal')
compare_distributions(fintech_df, 'avg_cur_bal', log_avg_cur_bal)

In [None]:
fintech_df = apply_transformation(fintech_df, 'avg_cur_bal', log_avg_cur_bal)
outliers = detect_outliers(fintech_df, 'avg_cur_bal', method='IQR')
len(outliers)


After applying the log transformation to `avg_cur_bal`, the outliers decreased significantly from **5.52%** to **0.31%**, and the distribution became much closer to normal. This demonstrates the transformation's effectiveness in reducing skewness and outliers.

---

Handling outliers in the `tot_cur_bal`

We will apply the log transform on the `tot_cur_bal` column as it is positively skewed

In [None]:
log_tot_cur_bal = get_log_transformation(fintech_df, 'tot_cur_bal')
compare_distributions(fintech_df, 'tot_cur_bal', log_tot_cur_bal)

In [None]:
fintech_df = apply_transformation(fintech_df, 'tot_cur_bal', log_tot_cur_bal)
outliers = detect_outliers(fintech_df, 'tot_cur_bal', method='IQR')
len(outliers)


After applying the log transformation to `tot_cur_bal`, the outliers decreased significantly from **3.43%** to **0.425%**, and the distribution became much closer to normal. This demonstrates the transformation's effectiveness in reducing skewness and outliers.

---

Handling outliers in the `loan_amount`

We will apply the log transform on the `loan_amount` column as it is positively skewed

In [None]:
log_loan_amount = get_log_transformation(fintech_df, 'loan_amount')
compare_distributions(fintech_df, 'loan_amount', log_loan_amount)

Although the log transformation decreased the number of outliers for the `loan_amount`, it increased the left skewness of the distribution. Therefore, we will use the capping method to handle the outliers more effectively without distorting the distribution."

In [None]:
def cap_outliers(df, column):
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    
    # Calculate IQR (Interquartile Range)
    IQR = Q3 - Q1
    
    # Calculate the lower and upper bounds for capping
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f'Lower Bound: {lower_bound:.3f}')
    print(f'Upper Bound: {upper_bound:.3f}')
    
    cap_column = np.where(df[column] < lower_bound, lower_bound, 
                          np.where(df[column] > upper_bound, upper_bound, df[column]))
    
    return cap_column


In [None]:
cap_loan_amount = cap_outliers(fintech_df, 'loan_amount')
compare_distributions(fintech_df, 'loan_amount', cap_loan_amount)
fintech_df['loan_amount'].describe()

In [None]:
fintech_df = apply_transformation(fintech_df, 'loan_amount', cap_loan_amount)
outliers = detect_outliers(fintech_df, 'loan_amount', method='IQR')
len(outliers)


The `loan_amount` column initially contained outliers that were either below -10,890.625 or above 39,484.375, with 2.438% of the data being considered outliers. After applying the capping method, these outliers were removed, bringing the percentage of outliers down to 0%. Additionally, the distribution after capping remains similar to the original, as the extreme values were handled without drastically altering the overall shape of the data distribution.

---

Handling outliers in the `funded_amount`


The `funded_amount` column had the exact same outlier range and IQR limits as the loan amount and a similar percentage of outliers. Therefore, we will apply the same capping method to the funded amount column to handle these outliers effectively.

In [None]:
cap_funded_amount = cap_outliers(fintech_df, 'funded_amount')
compare_distributions(fintech_df, 'funded_amount', cap_funded_amount)

In [None]:
fintech_df = apply_transformation(fintech_df, 'funded_amount', cap_funded_amount)
outliers = detect_outliers(fintech_df, 'funded_amount', method='IQR')
len(outliers)


After applying the capping method to the `funded_amount` the percentage of outliers is 0%
This means that, just like the `loan_amount`, all outliers were successfully handled by the capping method, and there are now no outliers remaining in the `funded_amount` column. This ensures that extreme values won't distort further analysis.

---

Handling outliers in the `int_rate`
We will apply the log transform on the `int_rate` column as it is positively skewed


In [None]:
log_int_rate = get_log_transformation(fintech_df, 'int_rate')
compare_distributions(fintech_df, 'int_rate', log_int_rate)

In [None]:
fintech_df = apply_transformation(fintech_df, 'int_rate', log_int_rate)
outliers = detect_outliers(fintech_df, 'int_rate', method='IQR')
len(outliers)


The log transformation for `int_rate` not only shifted the distribution closer to normality but also eliminated all outliers from the `int_rate` column. This significantly reduced the percentage of outliers from 2.294% to 0%, improving the stability of the data for further analysis.

# 4 - Data transformation and feature eng.

## 4.1 - Adding Columns

1. Month Number Column

In [None]:
def add_month_number(df, date_column):
    df[date_column] = pd.to_datetime(df[date_column])
    df['month_number'] = df[date_column].dt.month
    return df

In [None]:
fintech_df = add_month_number(fintech_df, 'issue_date')
fintech_df.head()


2. Salary Can Cover Loan

In [None]:
def add_salary_can_cover(df, log_annual_income_column, loan_amount_column):
    # Reverse the log transformation of annual income
    df['salary_can_cover'] = (np.exp(df[log_annual_income_column]) >= df[loan_amount_column]).astype(int)
    return df



In [None]:
fintech_df = add_salary_can_cover(fintech_df, 'annual_inc', 'loan_amount')
fintech_df.head()

3. Letter Grade

In [None]:
def map_grade(grade):
        if 1 <= grade <= 5:
            return 'A'
        elif 6 <= grade <= 10:
            return 'B'
        elif 11 <= grade <= 15:
            return 'C'
        elif 16 <= grade <= 20:
            return 'D'
        elif 21 <= grade <= 25:
            return 'E'
        elif 26 <= grade <= 30:
            return 'F'
        elif 31 <= grade <= 35:
            return 'G'
        else:
            return 'Unknown'  # In case there are grades outside the expected range

def update_lookup_with_grades(lookup_df):    
    for i in range(1, 36):
        letter = map_grade(i)
        lookup_df = pd.concat([lookup_df, pd.DataFrame([{'column': 'grade', 'original': str(i), 'imputed': letter}])], ignore_index=True)
    return lookup_df

lookup_df = update_lookup_with_grades(lookup_df)

In [None]:
def add_letter_grade(df, grade_column):
    df['letter_grade'] = df[grade_column].apply(map_grade)
    return df

In [None]:
fintech_df = add_letter_grade(fintech_df, 'grade')
fintech_df.head()


4. Installment per Month Calculation

In [None]:
def calculate_monthly_installment(df, loan_amount_column, log_int_rate_column, term_column):
    df_copy = df.copy()
    # Convert term to months (e.g., '36 months' -> 36)
    df_copy[term_column] = df_copy[term_column].str.extract('(\d+)').astype(int)
    
    # Calculate monthly installment directly in the apply function without adding intermediary columns
    df['installment_per_month'] = df_copy.apply(
        lambda row: (row[loan_amount_column] * (np.exp(row[log_int_rate_column]) / 12) * (1 + (np.exp(row[log_int_rate_column]) / 12)) ** row[term_column]) / 
                    ((1 + (np.exp(row[log_int_rate_column]) / 12)) ** row[term_column] - 1)
        if np.exp(row[log_int_rate_column]) > 0 else row[loan_amount_column] / row[term_column], axis=1
    )
    
    return df


In [None]:
fintech_df = calculate_monthly_installment(fintech_df, 'loan_amount', 'int_rate', 'term')
fintech_df.head()

## 4.2 - Encoding

In [None]:

def label_encode_column(df, column, new_column):
    le = LabelEncoder()
    df[new_column] = le.fit_transform(df[column])
    return df


In [None]:
def one_hot_encode_columns(df, columns):
    for column in columns:
        one_hot_encoded = pd.get_dummies(df[column], prefix=column)
        # Convert boolean values to integers (0 and 1)
        one_hot_encoded = one_hot_encoded.astype(int)
        # Concatenate the one-hot encoded columns to the original dataframe
        df = pd.concat([df, one_hot_encoded], axis=1)
        
        df.drop(columns=column, inplace=True)
    return df


In [None]:
fintech_df_encoded = fintech_df.copy()
fintech_df_encoded = label_encode_column(fintech_df_encoded, 'letter_grade','letter_grade_encoded')
lookup_df = add_lookup_values(lookup_df, 'letter_grade', fintech_df_encoded['letter_grade'], fintech_df_encoded['letter_grade_encoded'])
fintech_df_encoded.head()

We used Label Encoding for `letter_grade` because it is an ordinal feature where the order matters. The grades (A-G) follow a sequence where A > B > C, etc., and this relationship needs to be preserved in the encoding.

After Label Encoding, the `letter_grade` column now contains integer values representing the grades (e.g., A -> 0, B -> 1, etc.).

In [None]:
fintech_df_encoded = label_encode_column(fintech_df_encoded, 'addr_state', 'addr_state_encoded')
fintech_df_encoded = label_encode_column(fintech_df_encoded, 'state','state_encoded')
fintech_df_encoded = label_encode_column(fintech_df_encoded, 'purpose','purpose_encoded')

lookup_df = add_lookup_values(lookup_df, 'addr_state', fintech_df_encoded['addr_state'], fintech_df_encoded['addr_state_encoded'])
lookup_df = add_lookup_values(lookup_df, 'state', fintech_df_encoded['state'], fintech_df_encoded['state_encoded'])
lookup_df = add_lookup_values(lookup_df, 'purpose', fintech_df_encoded['purpose'], fintech_df_encoded['purpose_encoded'])

fintech_df_encoded.head()

We used Label Encoding for `addr_state` and `state` because they have alot of unique values and doing the one-hot-encoding would add alot of rows.

After Label Encoding, the `addr_state` and `state` columns now contain integer values representing the states.

In [None]:
def label_encode_loan_status(df):
    mapping = {'Fully Paid':1,'Current': 2, 'In Grace Period': 3, 'Late (16-30 days)': 4, 'Late (31-120 days)': 5, 'Default': 6, 'Charged Off': 7}
    df['loan_status_encoded'] = df['loan_status'].map(mapping)

    return df

fintech_df_encoded = label_encode_loan_status(fintech_df_encoded)
lookup_df = add_lookup_values(lookup_df, 'loan_status', fintech_df_encoded['loan_status'], fintech_df_encoded['loan_status_encoded'])

fintech_df_encoded.head()

In [None]:
def label_encode_verification_status(df):
    mapping = {'Not Verified': 1, 'Verified': 2, 'Source Verified': 3}
    df['verification_status_encoded'] = df['verification_status'].map(mapping)    
    return df

fintech_df_encoded = label_encode_verification_status(fintech_df_encoded)
lookup_df = add_lookup_values(lookup_df, 'verification_status', fintech_df_encoded['verification_status'], fintech_df_encoded['verification_status_encoded'])

fintech_df_encoded.head()


We used `verification_status`,`loan_status` because it is an ordinal feature where the order matters, and this relationship needs to be preserved in the encoding.

After Label Encoding, the `verification_status`,`loan_status` column now contains integer values representing the statuses


In [None]:
fintech_df_encoded = one_hot_encode_columns(fintech_df_encoded, ['home_ownership', 'term', 'type'])

fintech_df_encoded.head()

We used One-Hot Encoding for `home_ownership`, `term`, `type` because these are nominal features where there is no inherent order among the categories. One-hot encoding allows us to represent these categories without introducing any false ordinal relationship.

After One-Hot Encoding, the dataset now has binary columns representing each category in `home_ownership`, `term`, `type`. For example, home_ownership_RENT would be 1 if the home ownership status is "RENT" and 0 otherwise.

## 4.3 - Normalization 

We have already handled the outliers for most of the numerical columns using log transformations, except for the `loan_amount` and `funded_amount` columns. Therefore, we will now proceed to normalize these two columns.

In [None]:
normalized_loan = get_log_transformation(fintech_df_encoded, 'loan_amount')
compare_distributions(fintech_df_encoded, 'loan_amount', normalized_loan)

As observed the log transformation did not improve the normalization of the `loan_amount` but this is the best we can do right now.

In [None]:
normalized_funded = get_log_transformation(fintech_df_encoded, 'funded_amount')
compare_distributions(fintech_df_encoded, 'funded_amount', normalized_funded)

Again, s observed the log transformation did not improve the normalization of the `funded_amount` but this is the best we can do right now.

# 5 - Lookup Table(s)

In [None]:
lookup_df.to_csv('lookup_table.csv', index=False)
# Check the lookup table content
lookup_df.head()

# 6 - Bonus ( Data Integration )

In [None]:
def fetch_and_map_state_names(df, state_column):
  url = "https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971"
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  table = soup.find('table')

  # <AlphaCode, StateName>
  state_dict = {}

  for row in table.find_all('tr')[1:]:
    columns = row.find_all('td')
    
    if len(columns) >= 2:  
      alpha_code = columns[2].text.strip()
      state_name = columns[0].text.strip()  
      
      state_dict[alpha_code] = state_name

  df['state_name'] = df[state_column].map(state_dict)
  return df

fintech_df_encoded = fetch_and_map_state_names(fintech_df_encoded, 'state')
fintech_df_encoded.head()


In [None]:

fintech_df_encoded[['state', 'state_name']].head()

In [None]:
fintech_df_encoded.drop('state', axis=1, inplace=True)
fintech_df_encoded.head()


## 5- Exporting the dataframe to a csv file or parquet

In [None]:
fintech_df_encoded.to_parquet('./fintech_data_MET_P2_52_0812_clean.parquet')



In [None]:
fintech_df_encoded.head()

In [None]:
# Ensure the correct df saved
df = pd.read_parquet('./fintech_data_MET_P2_52_0812_clean.parquet')
df.head()