# Phase 4 — Feature Engineering (préparation des variables)

___

Préparer les données pour la modélisation du risque de crédit en utilisant un dataset allégé :
- fichier source : `data/interim/lendingclub_light.csv`
- création de la cible `default_flag` (règle Phase 3)
- conversion des formats (term → term_months)
- traitement simple des valeurs extrêmes (annual_inc)
- encodage des catégories (grade)
- création de X (features) et y (target)

___

Un modèle ne comprend que des nombres :
- `term` est du texte ("36 months") → à convertir
- `grade` est une catégorie (A,B,C...) → à encoder
- `annual_inc` contient des valeurs extrêmes → à stabiliser

___

- `data/processed/X_features.csv`
- `data/processed/y_target.csv`
- `reports/tableau_exports/04_feature_list.csv`


In [6]:
import pandas as pd
import os
os.path.exists("data/interim/lendingclub_light.csv")


True

In [3]:
df = pd.read_csv("data/interim/lendingclub_light.csv", low_memory=False)


In [4]:
df.shape


(250000, 8)

In [7]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,grade,annual_inc,loan_status,addr_state,dti
0,20000.0,36 months,13.99,C,65000.0,Fully Paid,CA,13.68
1,7000.0,36 months,9.16,B,35000.0,Fully Paid,TX,22.39
2,20000.0,36 months,8.67,B,90000.0,Fully Paid,UT,29.14
3,16000.0,36 months,14.46,C,50000.0,Fully Paid,IL,34.64
4,4000.0,36 months,11.53,B,85000.0,Fully Paid,WI,24.27


In [10]:
#création de la variable target
df["default_flag"] = df.loan_status.apply(lambda x: 1 if x in ["Charged Off", "Default"] else 0)

df["default_flag"].value_counts(normalize=True)

default_flag
0    0.800412
1    0.199588
Name: proportion, dtype: float64

In [12]:
# Convertir la variable term en nombre 
df["term_months"] = df["term"].astype(str).str.extract(r"(\d+)").astype(int)
df.term_months.head()

0    36
1    36
2    36
3    36
4    36
Name: term_months, dtype: int64

In [21]:
# Gestion des outliers pour la colonne annual_inc
p99 = df["annual_inc"].quantile(0.99)
df["annual_inc_clean"] = df["annual_inc"].clip(lower=0, upper=p99)
df[["annual_inc", "annual_inc_clean"]].describe()

Unnamed: 0,annual_inc,annual_inc_clean
count,250000.0,250000.0
mean,76218.51,74497.822578
std,70756.04,42124.284787
min,0.0,0.0
25%,45503.25,45503.25
50%,65000.0,65000.0
75%,90000.0,90000.0
max,7600000.0,250000.0


In [None]:
#encoder la colonne grade (catégorie -> colonnes 0/1)
df = pd.get_dummies(df, columns=["grade"], drop_first= True)

In [23]:
[c for c in df.columns if c.startswith("grade_")][:10]

['grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F', 'grade_G']

In [26]:
feature_cols = ["loan_amnt", "term_months", "int_rate", "annual_inc_clean"] + \
               [ c for c in df.columns if c.startswith("grade_")]

X = df[feature_cols].copy()
y = df["default_flag"].copy()

X.shape, y.shape

((250000, 10), (250000,))

## Export 

In [27]:
X.to_csv("../data/processed/X_features.csv", index=False)
y.to_csv("../data/processed/y_target.csv", index=False)

print("Exports OK:")
print("- data/processed/X_features.csv")
print("- data/processed/y_target.csv")


Exports OK:
- data/processed/X_features.csv
- data/processed/y_target.csv


In [28]:
feature_list = pd.DataFrame({"feature": feature_cols})
feature_list.to_csv("../reports/tableau_exports/04_feature_list.csv", index=False)
feature_list.head(20)


Unnamed: 0,feature
0,loan_amnt
1,term_months
2,int_rate
3,annual_inc_clean
4,grade_B
5,grade_C
6,grade_D
7,grade_E
8,grade_F
9,grade_G


In [29]:
X

Unnamed: 0,loan_amnt,term_months,int_rate,annual_inc_clean,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G
0,20000.0,36,13.99,65000.0,False,True,False,False,False,False
1,7000.0,36,9.16,35000.0,True,False,False,False,False,False
2,20000.0,36,8.67,90000.0,True,False,False,False,False,False
3,16000.0,36,14.46,50000.0,False,True,False,False,False,False
4,4000.0,36,11.53,85000.0,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
249995,11700.0,36,12.12,55380.0,True,False,False,False,False,False
249996,10000.0,36,10.91,49032.0,True,False,False,False,False,False
249997,25000.0,36,11.99,100000.0,True,False,False,False,False,False
249998,27200.0,60,18.99,61842.0,False,False,False,True,False,False


In [30]:
y

0         0
1         0
2         0
3         0
4         0
         ..
249995    0
249996    0
249997    0
249998    0
249999    0
Name: default_flag, Length: 250000, dtype: int64