<a href="https://colab.research.google.com/github/samuelhtampubolon/SDPM2025/blob/main/Pinjaman_Bank_B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import pandas as pd
import numpy as np

# Langkah 1: Baca file
file_path = '/content/sample_data/Pinjaman_Bank_B.xlsx'
df = pd.read_excel(file_path)

In [7]:
# Langkah 2: Konversi kolom numerik, handle format salah (komma, 'Invalid')
numeric_cols = ['Age', 'Annual_Income', 'Loan_Amount', 'Loan_Term_Months',
                'Interest_Rate', 'Credit_Score', 'Employment_Years']

for col in numeric_cols:
    # Hapus komma dan konversi ke numeric, errors='coerce' ubah invalid ke NaN
    if col == 'Annual_Income':
        df[col] = df[col].astype(str).str.replace(',', '').replace('Invalid', np.nan)
    else:
        df[col] = df[col].astype(str).replace('Invalid', np.nan)
    df[col] = pd.to_numeric(df[col], errors='coerce')

In [8]:
# Langkah 3: Handle nilai negatif (ubah ke NaN, karena tidak masuk akal)
for col in ['Age', 'Annual_Income', 'Loan_Amount', 'Loan_Term_Months', 'Employment_Years']:
    df[col] = df[col].apply(lambda x: np.nan if x < 0 else x)

In [9]:
# Langkah 4: Handle missing values (imputasi)
# Numerik: Gunakan median
for col in numeric_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Kategorikal: Gunakan mode
categorical_cols = ['Gender', 'Loan_Status']
for col in categorical_cols:
    mode_val = df[col].mode()[0] if not df[col].mode().empty else 'Unknown'
    df[col].fillna(mode_val, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mode_val, inplace=True)


In [10]:
# Langkah 5: Koreksi kategori tidak valid (misalnya 'Unknown' di Gender ganti ke mode)
df['Gender'] = df['Gender'].replace('Unknown', df['Gender'].mode()[0])

In [11]:
# Langkah 6: Handle outliers (clip ke rentang wajar)
df['Age'] = df['Age'].clip(18, 100)
df['Annual_Income'] = df['Annual_Income'].clip(0, 1000000)  # Asumsi max 1 juta
df['Loan_Amount'] = df['Loan_Amount'].clip(0, 1000000)
df['Loan_Term_Months'] = df['Loan_Term_Months'].clip(12, 360)
df['Interest_Rate'] = df['Interest_Rate'].clip(0, 20)
df['Credit_Score'] = df['Credit_Score'].clip(300, 850)
df['Employment_Years'] = df['Employment_Years'].clip(0, 50)

In [12]:
# Langkah 7: Hapus duplikat baris
df.drop_duplicates(inplace=True)

In [13]:
# Langkah 8: Konversi tipe data akhir (pastikan numerik adalah float/int)
for col in numeric_cols:
    if col in ['Age', 'Loan_Term_Months', 'Credit_Score', 'Employment_Years']:
        df[col] = df[col].astype(int)
    else:
        df[col] = df[col].astype(float)

In [14]:
# Langkah 9: Simpan ke file CSV baru
df.to_csv('loan_data_cleaned.csv', index=False)

# Tampilkan preview data cleaned
print(df.head())
print("\nShape setelah cleaning:", df.shape)

  Loan_ID Borrower_Name  Age Gender  Annual_Income  Loan_Amount  \
0   L0001    Borrower 1   56  Other        87172.0     444430.0   
1   L0002    Borrower 2   69  Other       113264.0     497738.0   
2   L0003    Borrower 3   46  Other        46736.0      91416.0   
3   L0004    Borrower 4   32   Male       132859.0     337415.0   
4   L0005    Borrower 5   60  Other       132181.0     311208.0   

   Loan_Term_Months  Interest_Rate  Credit_Score  Employment_Years Loan_Status  
0                48          12.85           512                28    Approved  
1                12          10.82           502                 3    Rejected  
2                96           5.48           528                 9    Rejected  
3                36           6.29           552                16    Rejected  
4                84           5.58           831                 9    Approved  

Shape setelah cleaning: (109, 11)
