### Business Understanding

**Identifiers & Metadata**

application_id: A unique identifier for each individual loan application.

customer_id: A unique identifier for each customer. A single customer may have multiple applications.

application_date: The date on which the loan application was submitted.

data_batch_id: An identifier for the data processing batch this record belongs to.

**Loan Characteristics**

loan_amount_requested: The principal amount of the loan requested by the applicant.

loan_amount_usd: The requested loan amount converted to US Dollars for standardization.

loan_tenure_months: The duration of the loan repayment period in months.

interest_rate_offered: The annual interest rate offered for the loan.

purpose_of_loan: The stated reason for seeking the loan.

loan_type_*: A set of binary columns indicating the specific type of loan product.

**Applicant Financial Profile**

employment_status: The applicant's current employment situation.

monthly_income: The applicant's stated gross monthly income.

yearly_income: The applicant's stated gross annual income.

annual_bonus: The applicant's declared annual bonus amount.

cibil_score: A credit score (e.g., from CIBIL) representing the applicant's creditworthiness and history. Higher scores indicate better credit health.

existing_emis_monthly: The total amount of Equated Monthly Installments (EMIs) the applicant is currently paying for other existing loans.

debt_to_income_ratio: This ratio helps assess an applicant's ability to manage monthly payments.

credit_utilization_ratio: The ratio of the applicant's outstanding credit card debt to their total credit card limit.

**Applicant Demographics & Personal Information**

applicant_age: The age of the applicant in years at the time of application.

gender_*: A set of one-hot encoded binary columns representing the applicant's gender.

property_ownership_status: The applicant's housing situation.

residential_address: The applicant's provided residential address (likely anonymized or generalized).

number_of_dependents: The number of people financially dependent on the applicant.

**Target Variable**

fraud_flag: This is the key target variable for prediction. It's a binary indicator where 1 signifies a fraudulent application and 0 signifies a legitimate application.

In [73]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier

#### Data Treatment

In [88]:
df = pd.read_csv('train.csv')
df = df.drop(columns=['Unnamed: 0', 'data_batch_id']) ##removing first column, that looks just an random id
# keep only distinct rows, removing duplicated ones
df=df.drop_duplicates()

# drop dates
df = df.drop(columns='application_date')

categCols = ['purpose_of_loan', 'employment_status', 'property_ownership_status']
df[categCols] = df[categCols].astype('category')

# removing untrustable dummy columns
dummyDrop = [col for col in df.columns if 'loan_type' in col]
df = df.drop(columns=dummyDrop)

df.dtypes.category

AttributeError: 'Series' object has no attribute 'category'

#### Outliers Removal

In [75]:
# removing outliers by the loan tenure months
q1 = df['loan_tenure_months'].quantile(0.25)
q3 = df['loan_tenure_months'].quantile(0.75)
iqr = q3-q1
upperBound = q3 + iqr*1.5

# removing outliers
df = df[df['loan_tenure_months']<=upperBound]

#### Imputation

##### Gender

In [76]:
def gender_code(row):
    if row['gender_Male'] == 1:
        return 1
    elif row['gender_Other'] == 1:
        return 0
    else:
        None #columns as neither male or other will be treated as unkown

df['gender_code'] = df.apply(gender_code, axis=1)

In [77]:
## gender imputation through KNN

# defining known and unknown gender
dfKnown = df[df['gender_code'].notna()]
dfUnknown = df[df['gender_code'].isna()]

# defining train columns
colsTrain = ['debt_to_income_ratio', 'applicant_age', 'yearly_income', 'annual_bonus']

## using the known data to train the model

# defining the train dataset
xTrain = dfKnown[colsTrain]
xTest = dfUnknown[colsTrain]
yTrain = dfKnown['gender_code']

# standardizing the known and unknown selected attributes
scaler = StandardScaler()
xTrainScaled = scaler.fit_transform(xTrain)
xTestScaled = scaler.transform(xTest)

# training the model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(xTrainScaled, yTrain)

# replacing the missing values by the KNN prediction
df.loc[df['gender_code'].isna(), 'gender_code'] = knn.predict(xTestScaled)

In [78]:
# recreating the gender label, where "0" is man and "1" is woman
genderDrop = ['gender_Male', 'gender_Other']
df = df.drop(columns=genderDrop)

# 0: woman ; 1: man
df['gender_code'].value_counts()

gender_code
0.0    20096
1.0    19541
Name: count, dtype: int64

##### Monthly Income

In [None]:
# inputting monthly income
df['monthly_income'] = df['monthly_income'].fillna(df['yearly_income']/12)