# Phase 2 — Qualité des données (Data Quality Check)

## Objectif
Analyser la qualité du fichier LendingClub :
- valeurs manquantes
- formats des variables clés
- valeurs aberrantes simples




In [2]:
import pandas as pd

df = pd.read_csv("/Users/master/Downloads/credit-climate-risk-lab/data/raw/lendingclub.csv", low_memory=False)

In [13]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,


In [8]:
# Analyse des valeurs maquantes
missing = (
    df.isna().mean().reset_index()
)
missing.columns = ["column", "missing_rate"]
missing = missing.sort_values("missing_rate", ascending = False)
missing.head(15)


Unnamed: 0,column,missing_rate
1,member_id,1.0
140,orig_projected_additional_accrued_interest,0.996173
135,hardship_end_date,0.995171
134,hardship_start_date,0.995171
129,hardship_type,0.995171
130,hardship_reason,0.995171
131,hardship_status,0.995171
132,deferral_term,0.995171
142,hardship_last_payment_amount,0.995171
141,hardship_payoff_balance_amount,0.995171


In [9]:
missing.to_csv(
    "/Users/master/Downloads/credit-climate-risk-lab/reports/tableau_exports/02_missing_rate_by_column.csv",
    index=False
)


In [12]:
# Analyse des valeurs maquantes sur le dataframe métier
important_columns = ["loan_status", "loan_amnt", "term","int_rate", "annual_inc", "grade", "addr_state"]
df[important_columns].isna().mean().reset_index()

Unnamed: 0,index,0
0,loan_status,1.5e-05
1,loan_amnt,1.5e-05
2,term,1.5e-05
3,int_rate,1.5e-05
4,annual_inc,1.6e-05
5,grade,1.5e-05
6,addr_state,1.5e-05


### Observation — valeurs manquantes (colonnes métier)
___


Les colonnes clés du projet (loan_status, loan_amnt, term, int_rate, annual_inc, grade, addr_state)
présentent un taux de valeurs manquantes inférieur à 0.01 %.

Ce niveau est considéré comme négligeable.
Aucune action corrective n’est prise à ce stade.
La décision d’imputation ou de suppression sera prise ultérieurement,
au moment de la préparation des données pour la modélisation.


In [15]:
df["term"].head()



0     36 months
1     36 months
2     60 months
3     60 months
4     60 months
Name: term, dtype: object

### Observation — term (durée du prêt)
___
La variable `term` est stockée sous forme de texte (ex: "36 months", "60 months").
Pour être utilisée dans un modèle, elle devra être convertie en valeur numérique
(nombre de mois).

In [16]:
df["int_rate"].head()

0    13.99
1    11.99
2    10.78
3    14.85
4    22.45
Name: int_rate, dtype: float64

### Observation — int_rate (taux d’intérêt)

La variable `int_rate` est déjà stockée sous forme numérique.



In [17]:
# Analyse du revenue annuel
df["annual_inc"].describe()

count    2.260664e+06
mean     7.799243e+04
std      1.126962e+05
min      0.000000e+00
25%      4.600000e+04
50%      6.500000e+04
75%      9.300000e+04
max      1.100000e+08
Name: annual_inc, dtype: float64

### Observation — annual_inc (revenu annuel)
___
La variable `annual_inc` présente une distribution cohérente pour la majorité des observations
(médiane autour de 65 000 USD), mais également des valeurs extrêmes non réalistes
(min = 0, max > 100 millions).

Ces valeurs sont considérées comme des outliers potentiels.

In [18]:
quality_summary = pd.DataFrame({
    "check": [
        "Missing values",
        "Outliers in annual_inc",
        "Non-numeric term",
        "Interest rate format"
    ],
    "observation": [
        "Negligible on key columns",
        "Extreme values detected (0 and very high values)",
        "Stored as text (e.g. '36 months')",
        "Already numeric"
    ]
})

quality_summary
quality_summary.to_csv(
    "/Users/master/Downloads/credit-climate-risk-lab/reports/tableau_exports/02_data_quality_summary.csv",
    index=False
)

In [19]:
quality_summary

Unnamed: 0,check,observation
0,Missing values,Negligible on key columns
1,Outliers in annual_inc,Extreme values detected (0 and very high values)
2,Non-numeric term,Stored as text (e.g. '36 months')
3,Interest rate format,Already numeric
