# TP1 - Clean : Nettoyage des Donn√©es

Ce notebook d√©montre l'utilisation de l'API FastAPI pour le nettoyage de donn√©es.

## Objectifs

1. G√©n√©rer un dataset avec des d√©fauts (missing, doublons, outliers)
2. Obtenir un rapport qualit√©
3. Fitter un pipeline de nettoyage
4. Appliquer le nettoyage
5. Comparer avant/apr√®s

In [None]:
import requests
import pandas as pd
import json

# Configuration de l'API
BASE_URL = "http://localhost:8000"

print("‚úÖ Imports r√©ussis")

: 

## √âtape 1 : G√©n√©rer un dataset avec d√©fauts

Utilisons l'endpoint  avec 

In [None]:
# G√©n√©rer le dataset
response = requests.post(
    f"{BASE_URL}/dataset/generate",
    json={
        "phase": "clean",
        "seed": 42,
        "n": 1000
    }
)

data = response.json()
dataset_id = data["meta"]["dataset_id"]

print(f"Dataset ID: {dataset_id}")
print(f"Nombre de lignes: {data['result']['n_rows']}")
print(f"Colonnes: {data['result']['columns']}")

# Afficher un √©chantillon
df_sample = pd.DataFrame(data['result']['data_sample'])
df_sample.head(10)

## √âtape 2 : Obtenir un rapport qualit√© (avant nettoyage)

Utilisons  pour analyser les d√©fauts

In [None]:
# Obtenir le rapport
response = requests.get(f"{BASE_URL}/clean/report/{dataset_id}")
report_data = response.json()

report = report_data["report"]

print("üìä Rapport Qualit√© - AVANT nettoyage
")
print(f"Nombre total de lignes: {report['n_rows']}")
print(f"Nombre de doublons: {report['duplicates']}")
print("
Missing values par colonne:")
for col, stats in report['missing_values'].items():
    print(f"  {col}: {stats['count']} ({stats['rate']*100:.1f}%)")

print("
Outliers par colonne:")
for col, stats in report['outliers'].items():
    print(f"  {col}: {stats['count']} ({stats['rate']*100:.1f}%)")

print("
Types de donn√©es:")
for col, dtype in report['data_types'].items():
    print(f"  {col}: {dtype}")

## √âtape 3 : Fitter un pipeline de nettoyage

D√©finir les strat√©gies de nettoyage et apprendre le pipeline

In [None]:
# Fitter le pipeline
response = requests.post(
    f"{BASE_URL}/clean/fit",
    json={
        "meta": {
            "dataset_id": dataset_id
        },
        "params": {
            "impute_strategy": "mean",
            "outlier_strategy": "clip",
            "categorical_strategy": "one_hot"
        }
    }
)

fit_data = response.json()
cleaner_id = fit_data["result"]["cleaner_id"]

print(f"‚úÖ Pipeline de nettoyage cr√©√©: {cleaner_id}")
print("
R√®gles apprises:")
print(f"  Impute values: {fit_data['report']['rules_learned']['impute_values_count']} colonnes")
print(f"  Outlier bounds: {fit_data['report']['rules_learned']['outlier_bounds_count']} colonnes")
print(f"  Categorical mappings: {fit_data['report']['rules_learned']['categorical_mappings_count']} colonnes")

## √âtape 4 : Appliquer le nettoyage

Transformer les donn√©es avec le pipeline appris

In [None]:
# Appliquer la transformation
response = requests.post(
    f"{BASE_URL}/clean/transform",
    json={
        "meta": {
            "dataset_id": dataset_id
        },
        "params": {
            "cleaner_id": cleaner_id
        }
    }
)

transform_data = response.json()
cleaned_dataset_id = transform_data["result"]["processed_dataset_id"]

print(f"‚úÖ Nettoyage appliqu√©: {cleaned_dataset_id}")
print("
üìä Compteurs de nettoyage:")
counters = transform_data["report"]["counters"]
print(f"  Lignes avant: {counters['rows_before']}")
print(f"  Lignes apr√®s: {counters['rows_after']}")
print(f"  Doublons supprim√©s: {counters['duplicates_removed']}")
print("
  Missing values imput√©es:")
for col, count in counters['missing_imputed'].items():
    print(f"    {col}: {count}")
print("
  Outliers trait√©s:")
for col, count in counters['outliers_treated'].items():
    print(f"    {col}: {count}")

# Afficher un √©chantillon nettoy√©
df_clean = pd.DataFrame(transform_data['result']['data_sample'])
print("
‚úÖ √âchantillon de donn√©es nettoy√©es:")
df_clean.head(10)

## √âtape 5 : Comparaison Avant/Apr√®s

Visualiser l'am√©lioration de la qualit√© des donn√©es

In [None]:
print("üìä Rapport AVANT vs APR√àS
")
print("="*50)

report_before = fit_data['report']['quality_before']
report_after = transform_data['report']['report_after']

print(f"Lignes:  {report_before['n_rows']} ‚Üí {report_after['n_rows']}")
print(f"Doublons: {report_before['duplicates']} ‚Üí {report_after['duplicates']}")

print("
Missing values (taux):")
for col in report_before['missing_values'].keys():
    if col in report_after['missing_values']:
        before = report_before['missing_values'][col]['rate'] * 100
        after = report_after['missing_values'][col]['rate'] * 100
        print(f"  {col}: {before:.1f}% ‚Üí {after:.1f}%")

print("
‚úÖ Nettoyage termin√© avec succ√®s !")