## Feedforward NN on Test Dataset

Getting the basic NN implementation working for a subset of the dataset. Ouput should be an action_taken label from the options below:

| Code | Description                                      |
|:----:|:-------------------------------------------------|
| 1    | Loan originated                                  |
| 2    | Application approved but not accepted            |
| 3    | Application denied                               |
| 4    | Application withdrawn by applicant               |
| 5    | File closed for incompleteness                   |
| 6    | Purchased loan                                   |
| 7    | Preapproval request denied                       |
| 8    | Preapproval request approved but not accepted    |

In [1]:
import numpy as np
from models.feedforward import FeedForwardModel, TabularDataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Load data with low_memory=False to handle mixed types properly
df = pd.read_csv("datasets/500ktest_dataset_lar.csv", low_memory=False)
print(f"Data shape: {df.shape}")
print(f"\nData types:\n{df.dtypes.value_counts()}")
df.info()

Data shape: (499889, 74)

Data types:
int64      41
object     24
float64     9
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499889 entries, 0 to 499888
Data columns (total 74 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Unnamed: 0                                499889 non-null  int64  
 1   activity_year                             499889 non-null  int64  
 2   lei                                       499889 non-null  object 
 3   derived_msa_md                            499889 non-null  int64  
 4   state_code                                491047 non-null  object 
 5   county_code                               487609 non-null  float64
 6   census_tract                              485422 non-null  float64
 7   conforming_loan_limit                     498060 non-null  object 
 8   derived_loan_product_type                 499889 non-null 

## Preprocessing

In [3]:
# Check target distribution
target_col = "action_taken"
print(f"Target distribution:\n{df[target_col].value_counts().sort_index()}")
print(f"\nTarget unique values: {sorted(df[target_col].unique())}")

Target distribution:
action_taken
1    252377
2     14605
3     85576
4     62995
5     23941
6     52115
7      1963
8      6317
Name: count, dtype: int64

Target unique values: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8)]


#### Drop Unwanted Cols & Run Diagnostrics

In [4]:
# --- Drop ID and leakage columns safely ---

# Columns to drop (identifiers or data leakage sources)
cols_to_drop = [
    "Unnamed: 0",           # Index column
    "denial_reason_1",      # Leakage - only exists for denied applications
    "state_code",
    "county_code",
    "census_tract",
    "lei",
    "derived_msa_md",
    "state_code",
    "activity_year",
    "interest_rate",        # Leakage - only exists for accepted applications
    "total_loan_costs",     # Leakage - isnt known at the time of application
    "origination_charges",  # Leakage - isnt known at the time of application
    "discount_points",      # Leakage - isnt known at the time of application
    "lender_credits",       # Leakage - isnt known at the time of application
    "intro_rate_period"     # Potential Leakage - depends on the loan type (may be proposed or determined later)
]

# Tract-level census columns (optional to include later)
tract_cols = [
    "tract_population",
    "tract_minority_population_percent",
    "ffiec_msa_md_median_family_income",
    "tract_to_msa_income_percentage",
    "tract_owner_occupied_units",
    "tract_one_to_four_family_homes",
    "tract_median_age_of_housing_units"
]

# Combine and keep only columns that actually exist in df
drop_candidates = cols_to_drop + tract_cols
existing_to_drop = [col for col in drop_candidates if col in df.columns]

# Drop
df_preprocessed = df.drop(columns=existing_to_drop, errors='ignore')

# --- Diagnostics ---
print(f"‚úÖ Dropped {len(existing_to_drop)} columns ({len(drop_candidates) - len(existing_to_drop)} not found)")
print(f"Data shape after dropping: {df_preprocessed.shape}")
print(f"Remaining columns: {len(df_preprocessed.columns)}")

# Show object columns (categorical/text)
object_cols = df_preprocessed.select_dtypes(include=['object']).columns.tolist()
print(f"\nüß© Object columns ({len(object_cols)}): {object_cols}")

# Count missing values
missing = df_preprocessed.isnull().sum()
# Count "Exempt" values
exempt_counts = (df_preprocessed == "Exempt").sum()
# Combine into a single DataFrame for display
missing_summary = pd.DataFrame({
    'missing_values': missing,
    'exempt_values': exempt_counts
})
# Filter to columns that have either missing or exempt values
missing_summary = missing_summary[(missing_summary['missing_values'] > 0) | 
                                  (missing_summary['exempt_values'] > 0)]
# Display
if missing_summary.empty:
    print("‚úÖ No missing or 'Exempt' values detected.")
else:
    print(f"\n‚ö†Ô∏è Columns with missing or 'Exempt' values ({len(missing_summary)}):\n")
    print(missing_summary.sort_values(by=['missing_values', 'exempt_values'], ascending=False))



‚úÖ Dropped 22 columns (0 not found)
Data shape after dropping: (499889, 53)
Remaining columns: 53

üß© Object columns (16): ['conforming_loan_limit', 'derived_loan_product_type', 'derived_dwelling_category', 'derived_ethnicity', 'derived_race', 'derived_sex', 'combined_loan_to_value_ratio', 'rate_spread', 'loan_term', 'property_value', 'total_units', 'debt_to_income_ratio', 'applicant_age', 'co_applicant_age', 'applicant_age_above_62', 'co_applicant_age_above_62']

‚ö†Ô∏è Columns with missing or 'Exempt' values (13):

                              missing_values  exempt_values
co_applicant_age_above_62             319722              0
rate_spread                           250978          11624
debt_to_income_ratio                  166317          11599
combined_loan_to_value_ratio          165036          11612
property_value                        100779          11677
income                                 72816              0
applicant_age_above_62                 59837          

#### Fix Missing Values

In [5]:
import matplotlib.pyplot as plt

# Count NA and string ("Exempt") values per column
def count_na_exempt_per_action(df, action_col='action_taken'):
    # For each column, count NAs and "Exempt"s, grouped by action_taken
    result = {}
    for col in df.columns:
        if col == action_col:
            continue
        # NA count per action
        na_counts = df.groupby(action_col)[col].apply(lambda x: x.isna().sum())
        # "Exempt" count per action (only for object type cols)
        if df[col].dtype == object:
            exempt_counts = df.groupby(action_col)[col].apply(lambda x: (x == "Exempt").sum())
        else:
            exempt_counts = pd.Series(0, index=na_counts.index)
        result[col] = pd.DataFrame({'NA': na_counts, 'Exempt': exempt_counts})
    return result

# Visualize missing/"Exempt" counts for each column per action_taken
def plot_na_exempt_summary_na_cols(result_dict, top_n=6):
    # Focus on columns with any nonzero NA or "Exempt"
    for col, data in result_dict.items():
        if data['NA'].sum() > 0 or data['Exempt'].sum() > 0:
            data[['NA', 'Exempt']].plot(kind='bar', stacked=True, title=f"{col} - missing & 'Exempt' per action_taken")
            plt.xlabel('action_taken')
            plt.ylabel('Count')
            plt.tight_layout()
            plt.show()

# --- Run the summary ---
na_exempt_summary = count_na_exempt_per_action(df_preprocessed, action_col='action_taken')

# Build a combined summary DataFrame
all_actions = []
for col, d in na_exempt_summary.items():
    # Add a column for the variable to make unstacking later easier
    df_tmp = d.copy()
    df_tmp["variable"] = col
    all_actions.append(df_tmp.reset_index())
summary_df = pd.concat(all_actions, ignore_index=True)
# Reorder for easier inspection
summary_df = summary_df[["variable", "action_taken", "NA", "Exempt"]]

# Pivot for better at-a-glance viewing: index: variable, columns: action_taken, values: NA/Exempt
summary_na = summary_df.pivot_table(index="variable", columns="action_taken", values="NA", aggfunc='sum').fillna(0).astype(int)
summary_exempt = summary_df.pivot_table(index="variable", columns="action_taken", values="Exempt", aggfunc='sum').fillna(0).astype(int)

# Join NA and Exempt for each variable
summary_all = summary_na.add_suffix(" NA").join(summary_exempt.add_suffix(" Exempt"))

# Optionally sort rows by total NA+Exempt counts
summary_all["Total Missing"] = summary_na.sum(axis=1) + summary_exempt.sum(axis=1)
summary_all = summary_all.sort_values("Total Missing", ascending=False)

# Show the top rows (all at once)
display_cols = [col for col in summary_all.columns if col != "Total Missing"]
print("Summary of missing (NA) and 'Exempt' counts per variable and per action_taken:")
display(summary_all[display_cols].head(15))  # Show top 15 by default; change as needed

Summary of missing (NA) and 'Exempt' counts per variable and per action_taken:


action_taken,1 NA,2 NA,3 NA,4 NA,5 NA,6 NA,7 NA,8 NA,1 Exempt,2 Exempt,3 Exempt,4 Exempt,5 Exempt,6 Exempt,7 Exempt,8 Exempt
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
co_applicant_age_above_62,141187,9432,58272,39053,16175,50669,1264,3670,0,0,0,0,0,0,0,0
rate_spread,22880,2200,84294,62031,23804,51810,1912,2047,8406,369,1282,964,137,305,51,110
debt_to_income_ratio,17383,1960,9022,62027,23804,51903,84,134,8455,368,1299,968,137,212,51,109
combined_loan_to_value_ratio,10532,2344,14055,62023,23804,51903,158,217,8469,368,1294,972,137,212,51,109
property_value,3950,997,8158,62017,23801,975,236,645,8504,370,1313,978,140,212,51,109
income,17077,1485,3652,4647,1717,44173,30,35,0,0,0,0,0,0,0,0
applicant_age_above_62,7732,316,1243,1942,168,48399,14,23,0,0,0,0,0,0,0,0
loan_term,2493,104,4037,1157,246,329,15,16,8452,368,1290,972,137,212,51,110
conforming_loan_limit,1340,59,169,219,28,14,0,0,0,0,0,0,0,0,0,0
applicant_ethnicity_1,61,6,66,29,25,6,1,1,0,0,0,0,0,0,0,0


In [6]:
# conforming_loan_limit
df_preprocessed.conforming_loan_limit.unique() #array(['C', 'NC', nan, 'U'], dtype=object)
df_preprocessed['conforming_loan_limit'] = df_preprocessed['conforming_loan_limit'].fillna('NA')

# ---- property-related cols

# make new indicator cols
df_preprocessed['has_property'] = df_preprocessed['property_value'].notna().astype(int) # since some cols rely on whether the individuals has a property

df_preprocessed['exempt_has_property'] = (
    (df_preprocessed['combined_loan_to_value_ratio'] == "Exempt") |
    (df_preprocessed['property_value'] == "Exempt")
).astype(int)

# preprocess existing cols
df_preprocessed['combined_loan_to_value_ratio'] = df_preprocessed['combined_loan_to_value_ratio'].replace("Exempt", 0)
df_preprocessed['combined_loan_to_value_ratio'] = df_preprocessed['combined_loan_to_value_ratio'].fillna(0).astype(float)

df_preprocessed['property_value'] = df_preprocessed['property_value'].replace("Exempt", 0)
df_preprocessed['property_value'] = df_preprocessed['property_value'].fillna(0).astype(float)

# ----

df_preprocessed['exempt_rate_spread'] = ((df_preprocessed['rate_spread'] == "Exempt")).astype(float)
df_preprocessed['rate_spread'] = pd.to_numeric(df_preprocessed['rate_spread'], errors='coerce')
rate_spread_median = df_preprocessed['rate_spread'].median()
df_preprocessed['rate_spread'] = df_preprocessed['rate_spread'].fillna(rate_spread_median).astype(float)

df_preprocessed['exempt_loan_term'] = ((df_preprocessed['loan_term'] == "Exempt")).astype(float)
df_preprocessed['loan_term'] = pd.to_numeric(df_preprocessed['loan_term'], errors='coerce')
loan_term_median = df_preprocessed['loan_term'].median()
df_preprocessed['loan_term'] = df_preprocessed['loan_term'].fillna(loan_term_median).astype(float)

df_preprocessed['exempt_loan_term'] = ((df_preprocessed['loan_term'] == "Exempt")).astype(float)
df_preprocessed['loan_term'] = pd.to_numeric(df_preprocessed['loan_term'], errors='coerce')
loan_term_median = df_preprocessed['loan_term'].median()
df_preprocessed['loan_term'] = df_preprocessed['loan_term'].fillna(loan_term_median).astype(float)

df_preprocessed['exempt_debt_to_income_ratio'] = ((df_preprocessed['debt_to_income_ratio'] == "Exempt")).astype(float)
df_preprocessed['debt_to_income_ratio'] = pd.to_numeric(df_preprocessed['debt_to_income_ratio'], errors='coerce')
debt_to_income_ratio_median = df_preprocessed['debt_to_income_ratio'].median()
df_preprocessed['debt_to_income_ratio'] = df_preprocessed['debt_to_income_ratio'].fillna(debt_to_income_ratio_median).astype(float)

df_preprocessed['income'] = pd.to_numeric(df_preprocessed['income'], errors='coerce')
rate_spread_median = df_preprocessed['income'].median()
df_preprocessed['income'] = df_preprocessed['income'].fillna(rate_spread_median).astype(float)

df_preprocessed['co_applicant_age_above_62'] = df_preprocessed['co_applicant_age_above_62'].map({'Yes': 1, 'No': 0, np.nan: -1})

df_preprocessed['applicant_age_above_62'] = df_preprocessed['applicant_age_above_62'].map({'Yes': 1, 'No': 0, np.nan: -1})

df_preprocessed['applicant_ethnicity_1'] = df_preprocessed['applicant_ethnicity_1'].fillna(4).astype(float) # 4 is value for "not applicable"

df_preprocessed['applicant_race_1'] = df_preprocessed['applicant_race_1'].fillna(6).astype(float) # 6 is value for "not provided"

# If there is a value in any co-applicant column in that row, set value 6; otherwise, 8.
coapp_cols = [col for col in df_preprocessed.columns if col.startswith('co_applicant_')]
df_preprocessed['co_applicant_ethnicity_1'] = np.where(
    df_preprocessed[coapp_cols].notna().any(axis=1), 6, 8
).astype(float)

# If there is a value in any co-applicant column in that row, set value 6; otherwise, 8.
coapp_cols = [col for col in df_preprocessed.columns if col.startswith('co_applicant_')]
df_preprocessed['co_applicant_race_1'] = np.where(
    df_preprocessed[coapp_cols].notna().any(axis=1), 6, 8
).astype(float)

#### Process Categorical Variables

In [7]:
print("=" * 60)
print("Preview: df_preprocessed (first 20 rows, all columns)\n")
with pd.option_context('display.max_columns', None):
    display(df_preprocessed.head(20))
print("=" * 60)

print("\nDataFrame info:\n")
print("-" * 20)
df_preprocessed.info()
print("-" * 20)

print("\nMissing values per column:\n")
with pd.option_context('display.max_rows', None):
    missing = df_preprocessed.isnull().sum()
    display(missing[missing > 0].sort_values(ascending=False))
    if (missing == 0).all():
        print("No missing values detected.")


Preview: df_preprocessed (first 20 rows, all columns)



Unnamed: 0,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,derived_race,derived_sex,action_taken,purchaser_type,preapproval,loan_type,loan_purpose,lien_status,reverse_mortgage,open_end_line_of_credit,business_or_commercial_purpose,loan_amount,combined_loan_to_value_ratio,rate_spread,hoepa_status,loan_term,negative_amortization,interest_only_payment,balloon_payment,other_nonamortizing_features,property_value,construction_method,occupancy_type,manufactured_home_secured_property_type,manufactured_home_land_property_interest,total_units,income,debt_to_income_ratio,applicant_credit_score_type,co_applicant_credit_score_type,applicant_ethnicity_1,co_applicant_ethnicity_1,applicant_ethnicity_observed,co_applicant_ethnicity_observed,applicant_race_1,co_applicant_race_1,applicant_race_observed,co_applicant_race_observed,applicant_sex,co_applicant_sex,applicant_sex_observed,co_applicant_sex_observed,applicant_age,co_applicant_age,applicant_age_above_62,co_applicant_age_above_62,submission_of_application,initially_payable_to_institution,aus_1,has_property,exempt_has_property,exempt_rate_spread,exempt_loan_term,exempt_debt_to_income_ratio
0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Hispanic or Latino,White,Female,1,0,2,1,31,1,2,2,2,125000,90.0,-4.925,2,360.0,2,2,2,2,135000.0,1,1,3,5,1,104.0,37.0,2,10,1.0,6.0,2,4,5.0,6.0,2,4,2,5,2,4,25-34,9999,0,-1,1,1,6,1,0,0.0,0.0,0.0
1,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Male,1,1,2,1,32,1,2,2,2,205000,80.0,0.947,2,360.0,2,2,2,2,265000.0,1,1,3,5,1,76.0,43.0,3,10,2.0,6.0,2,4,5.0,6.0,2,4,1,5,2,4,25-34,9999,0,-1,1,1,2,1,0,0.0,0.0,0.0
2,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Female,3,0,2,1,2,2,2,2,2,25000,0.0,0.341,3,360.0,2,2,2,2,0.0,1,1,3,5,1,47.0,43.0,9,10,2.0,6.0,1,4,5.0,6.0,1,4,2,5,1,4,65-74,9999,1,-1,1,1,6,0,0,0.0,0.0,0.0
3,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Female,1,71,2,2,1,1,2,2,2,215000,96.5,1.31,2,360.0,2,2,2,2,215000.0,1,1,3,5,1,67.0,46.0,1,10,2.0,6.0,2,4,5.0,6.0,2,4,2,5,2,4,25-34,9999,0,-1,1,1,3,1,0,0.0,0.0,0.0
4,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Male,4,0,2,1,1,1,2,2,2,205000,0.0,0.341,3,360.0,2,2,2,2,0.0,1,1,3,5,1,60.0,43.0,9,9,2.0,6.0,2,4,5.0,6.0,2,4,1,5,2,4,25-34,9999,0,-1,1,1,1,0,0,0.0,0.0,0.0
5,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Hispanic or Latino,White,Joint,3,0,2,1,1,1,2,2,2,295000,97.0,0.341,3,360.0,2,2,2,2,305000.0,1,1,3,5,1,86.0,36.0,1,9,1.0,6.0,2,2,5.0,6.0,2,2,1,2,2,2,25-34,25-34,0,0,1,1,1,1,0,0.0,0.0,0.0
6,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Hispanic or Latino,White,Male,1,71,2,2,1,1,2,2,2,215000,90.0,1.339,2,360.0,2,2,2,2,235000.0,1,1,3,5,1,57.0,49.0,1,10,1.0,6.0,2,4,5.0,6.0,2,4,1,5,2,4,25-34,9999,0,-1,1,1,1,1,0,0.0,0.0,0.0
7,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Male,1,0,2,1,2,2,2,1,2,105000,77.682,0.35,2,300.0,2,1,2,2,465000.0,1,1,3,5,1,93.0,39.0,11,10,2.0,6.0,2,4,5.0,6.0,2,4,1,5,2,4,55-64,9999,0,-1,1,1,6,1,0,0.0,0.0,0.0
8,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Male,1,71,2,1,1,1,2,2,2,325000,95.0,0.623,2,360.0,2,2,2,2,345000.0,1,1,3,5,1,182.0,43.0,3,10,2.0,6.0,2,4,5.0,6.0,2,4,1,5,2,4,55-64,9999,0,-1,1,1,1,1,0,0.0,0.0,0.0
9,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,White,Joint,1,71,2,1,1,1,2,2,2,285000,90.0,0.62,2,360.0,2,2,2,2,325000.0,1,1,3,5,1,100.0,45.0,3,9,2.0,6.0,2,2,5.0,6.0,2,2,2,1,2,2,<25,<25,0,0,1,1,1,1,0,0.0,0.0,0.0



DataFrame info:

--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499889 entries, 0 to 499888
Data columns (total 58 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   conforming_loan_limit                     499889 non-null  object 
 1   derived_loan_product_type                 499889 non-null  object 
 2   derived_dwelling_category                 499889 non-null  object 
 3   derived_ethnicity                         499889 non-null  object 
 4   derived_race                              499889 non-null  object 
 5   derived_sex                               499889 non-null  object 
 6   action_taken                              499889 non-null  int64  
 7   purchaser_type                            499889 non-null  int64  
 8   preapproval                               499889 non-null  int64  
 9   loan_type                                 499889 non-

Series([], dtype: int64)

No missing values detected.


In [8]:
# First, determine categorical columns and process them:
from sklearn.preprocessing import LabelEncoder

categorical_cols = df_preprocessed.select_dtypes(include=['object', 'category']).columns.tolist()

print("--- Before")
for col in df_preprocessed.columns:
    print(f"{col}: {df_preprocessed[col].dtype}")

# One-hot encode 
df_preprocessed_encoded = pd.get_dummies(df_preprocessed, columns=categorical_cols, drop_first=True)

# Explicitly cast all columns to numeric types (float32 as fallback), to prevent np.object_ downstream
for col in df_preprocessed_encoded.columns:
    try:
        df_preprocessed_encoded[col] = pd.to_numeric(df_preprocessed_encoded[col], errors='raise')
    except Exception:
        # If conversion fails, fallback to float32 forcibly
        df_preprocessed_encoded[col] = df_preprocessed_encoded[col].astype(np.float32)

# Final safety: make sure there are no object dtypes left
for col in df_preprocessed_encoded.columns:
    if df_preprocessed_encoded[col].dtype == 'O':
        # Could still remain if .cat.codes above was skipped, so forcibly code and cast
        df_preprocessed_encoded[col] = (
            df_preprocessed_encoded[col].astype('category').cat.codes.astype(np.float32)
        )

print("\n--- After")
for col in df_preprocessed_encoded.columns:
    print(f"{col}: {df_preprocessed_encoded[col].dtype}")

print(df_preprocessed_encoded.select_dtypes(include=['object']).columns)


--- Before
conforming_loan_limit: object
derived_loan_product_type: object
derived_dwelling_category: object
derived_ethnicity: object
derived_race: object
derived_sex: object
action_taken: int64
purchaser_type: int64
preapproval: int64
loan_type: int64
loan_purpose: int64
lien_status: int64
reverse_mortgage: int64
open_end_line_of_credit: int64
business_or_commercial_purpose: int64
loan_amount: int64
combined_loan_to_value_ratio: float64
rate_spread: float64
hoepa_status: int64
loan_term: float64
negative_amortization: int64
interest_only_payment: int64
balloon_payment: int64
other_nonamortizing_features: int64
property_value: float64
construction_method: int64
occupancy_type: int64
manufactured_home_secured_property_type: int64
manufactured_home_land_property_interest: int64
total_units: object
income: float64
debt_to_income_ratio: float64
applicant_credit_score_type: int64
co_applicant_credit_score_type: int64
applicant_ethnicity_1: float64
co_applicant_ethnicity_1: float64
applican

#### Split 

In [9]:
# Split the preprocessed dataframe into train and test sets
target_col = "action_taken"
X = df_preprocessed_encoded.drop(columns=[target_col])
y = df_preprocessed_encoded[target_col]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)

print("\nChecking dtypes before training...")
print(X_train.dtypes.value_counts())

if not X_train.select_dtypes(include=['object']).empty:
    print("‚ö†Ô∏è Object columns found:")
    print(X_train.select_dtypes(include=['object']).columns.tolist())
else:
    print("‚úÖ All numeric types confirmed.")

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')
X_train = X_train.fillna(0).astype(float)
X_test = X_test.fillna(0).astype(float)



Checking dtypes before training...
bool       50
int64      35
float64    13
Name: count, dtype: int64
‚úÖ All numeric types confirmed.


## Training

In [10]:
# Train the model
print("Training NN...")
ff_model = FeedForwardModel(
    hidden_dims=[256, 128, 64],  # Slightly larger network
    dropout=0.3,
    learning_rate=0.0001
)

# Note: The model expects multi-class classification (8 classes for action_taken)
# We need to adjust target labels to be 0-indexed for CrossEntropyLoss
# The model will detect num_classes automatically
y_train_adj = y_train - 1  # Convert from 1-8 to 0-7
y_test_adj = y_test - 1

history = ff_model.train(
    X_train, y_train_adj, 
    X_test, y_test_adj, 
    epochs=50, 
    batch_size=256
)

# Evaluate - predict returns class indices (0-7), convert back to 1-8
y_pred = ff_model.predict(X_test) + 1  # Convert from 0-7 back to 1-8

acc = accuracy_score(y_test, y_pred)
print(f"\nFFNN Accuracy: {acc:.4f}")

# Show detailed evaluation metrics
from sklearn.metrics import classification_report, confusion_matrix
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

# Show per-class accuracy
print(f"\nPer-class accuracy:")
for class_label in sorted(y_test.unique()):
    mask = y_test == class_label
    if mask.sum() > 0:
        class_acc = accuracy_score(y_test[mask], y_pred[mask])
        print(f"  Class {class_label}: {class_acc:.4f} ({mask.sum()} samples)")

Training NN...


Training: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [02:50<00:00,  3.41s/epoch, train_loss=1.2619, val_loss=1.9460, val_acc=0.4370, val_f1=0.3111]



FFNN Accuracy: 0.4370

Classification Report:
              precision    recall  f1-score   support

           1       0.71      0.45      0.55     50476
           2       0.05      0.28      0.08      2921
           3       0.30      0.29      0.29     17115
           4       0.71      0.60      0.65     12599
           5       0.24      0.03      0.05      4788
           6       0.89      0.65      0.75     10423
           7       0.01      0.08      0.01       393
           8       0.05      0.44      0.10      1263

    accuracy                           0.44     99978
   macro avg       0.37      0.35      0.31     99978
weighted avg       0.60      0.44      0.50     99978


Confusion Matrix:
[[22812 10037  9056   555    17   497   128  7374]
 [  795   805   773   120     6    20    37   365]
 [ 4928  4105  4961   428    34   142   762  1755]
 [  183  1141  1089  7564   403   131  2078    10]
 [   23   593   658  1929   145    71  1368     1]
 [ 3101   254   100    84   