# Démonstration TextCleaner - Air Paradis

Démonstration simple et directe des transformations de `TextCleaner` via `advanced_preprocess()`.

**Objectif** : Visualiser l'impact de chaque paramètre de preprocessing sur des tweets Air Paradis.

## Section 1 : Setup

Imports et chargement de TextCleaner.

In [1]:
import sys
import os

# Ajouter le répertoire src au path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

from preprocessing.text_cleaner import TextCleaner

# Créer une instance de TextCleaner
cleaner = TextCleaner()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/anthonythevenin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Définition des tweets d'exemple Air Paradis avec caractéristiques variées.

In [2]:
# Tweets d'exemple avec contractions, négations, émotions
tweets = [
    "@AirParadis Can't believe they didn't help! Service was NOT good at all!!!",
    "I'm SOOO disappointed!!! Won't recommend @AirParadis. TERRIBLE experience!!!",
    "Flight wasn't bad but crew didn't care. Can't say I'm happy.",
    "@AirParadis LOOOOVE your service!!! Can't say enough good things! AMAZING!!!",
    "I'm thrilled! You're the BEST airline ever! Won't fly with anyone else!!!",
    "Great flight! Crew was professional, seats comfortable. Highly recommend!!!",
    "NOT happy at all. They didn't honor my booking. So RUUUDE!!!",
    "Service wasn't great. Food was terrible. Wouldn't fly again.",
    "Can't believe how GOOD @AirParadis is!!! Not a single complaint! PERFECT!!!",
    "WORST airline EVER!!! Never again! They don't care about customers!!!"
]

print(f"Total de {len(tweets)} tweets d'exemple chargés.")
print("\nAperçu des tweets originaux:")
for i, tweet in enumerate(tweets[:3], 1):
    print(f"{i}. {tweet}")

Total de 10 tweets d'exemple chargés.

Aperçu des tweets originaux:
1. @AirParadis Can't believe they didn't help! Service was NOT good at all!!!
2. I'm SOOO disappointed!!! Won't recommend @AirParadis. TERRIBLE experience!!!
3. Flight wasn't bad but crew didn't care. Can't say I'm happy.


## Section 2 : Expansion des Contractions

Transformation des contractions anglaises en formes complètes.

**Paramètre** : `expand_contractions` activé dans `advanced_preprocess`

In [3]:
print("="*80)
print("EXPANSION DES CONTRACTIONS")
print("="*80)
print()

# Sélectionner 3 tweets avec contractions
sample_tweets = [tweets[0], tweets[1], tweets[4]]

for i, tweet in enumerate(sample_tweets, 1):
    # Appliquer seulement l'expansion (sans autres transformations)
    expanded = cleaner.expand_contractions(tweet)
    
    print(f"Tweet {i}:")
    print(f"  AVANT : {tweet}")
    print(f"  APRÈS : {expanded}")
    print()

EXPANSION DES CONTRACTIONS

Tweet 1:
  AVANT : @AirParadis Can't believe they didn't help! Service was NOT good at all!!!
  APRÈS : @AirParadis cannot believe they did not help! Service was NOT good at all!!!

Tweet 2:
  AVANT : I'm SOOO disappointed!!! Won't recommend @AirParadis. TERRIBLE experience!!!
  APRÈS : i am SOOO disappointed!!! will not recommend @AirParadis. TERRIBLE experience!!!

Tweet 3:
  AVANT : I'm thrilled! You're the BEST airline ever! Won't fly with anyone else!!!
  APRÈS : i am thrilled! you are the BEST airline ever! will not fly with anyone else!!!



## Section 3 : Gestion des Émotions

Normalisation des répétitions et ajout de marqueurs émotionnels.

**Paramètre** : `handle_emotions=True`

In [4]:
print("="*80)
print("GESTION DES ÉMOTIONS")
print("="*80)
print()

# Tweets avec émotions fortes
emotion_tweets = [
    tweets[1],  # SOOO, TERRIBLE, !!!
    tweets[3],  # LOOOOVE, AMAZING, !!!
    tweets[6]   # RUUUDE, !!!
]

for i, tweet in enumerate(emotion_tweets, 1):
    # Appliquer advanced_preprocess avec émotions uniquement
    processed = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=True,
        remove_stopwords=False,
        use_lemmatization=False,
        use_stemming=False
    )
    
    print(f"Tweet {i}:")
    print(f"  AVANT : {tweet}")
    print(f"  APRÈS : {processed}")
    print()

GESTION DES ÉMOTIONS

Tweet 1:
  AVANT : I'm SOOO disappointed!!! Won't recommend @AirParadis. TERRIBLE experience!!!
  APRÈS : i am sooo caps disappointed excited will not recommend terrible caps experience excited

Tweet 2:
  AVANT : @AirParadis LOOOOVE your service!!! Can't say enough good things! AMAZING!!!
  APRÈS : loooove caps your service excited can not say enough good things amazing caps excited

Tweet 3:
  AVANT : NOT happy at all. They didn't honor my booking. So RUUUDE!!!
  APRÈS : not caps happy at all they did not honor my booking so ruuude caps excited



## Section 4 : Stemming vs Lemmatization

Comparaison directe des deux approches de normalisation.

**Stemming** : Réduction à la racine (SnowballStemmer)  
**Lemmatization** : Forme canonique (WordNetLemmatizer)

print("="*80)
print("STEMMING vs LEMMATIZATION")
print("="*80)
print()

# Tweets pour comparer stemming vs lemmatization
compare_tweets = [
    tweets[5],  # professional, comfortable, recommend
    tweets[3],  # service, things, amazing
    tweets[7]   # terrible, wouldn't
]

for i, tweet in enumerate(compare_tweets, 1):
    # Avec stemming
    stemmed = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=False,
        remove_stopwords=False,
        use_stemming=True,
        use_lemmatization=False
    )
    
    # Avec lemmatization
    lemmatized = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=False,
        remove_stopwords=False,
        use_stemming=False,
        use_lemmatization=True
    )
    
    print(f"Tweet {i}:")
    print(f"  AVANT         : {tweet}")
    print(f"  STEMMING      : {stemmed}")
    print(f"  LEMMATIZATION : {lemmatized}")
    print()

In [5]:
print("="*80)
print("SUPPRESSION DES STOPWORDS")
print("="*80)
print()

# Tweets avec beaucoup de stopwords
stopword_tweets = [
    tweets[0],  # "they didn't", "was NOT good at all"
    tweets[5],  # "was professional", "was comfortable"
    tweets[8]   # "how GOOD", "Not a single"
]

for i, tweet in enumerate(stopword_tweets, 1):
    # Sans suppression de stopwords
    without_removal = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=False,
        remove_stopwords=False,
        use_lemmatization=True,
        use_stemming=False
    )
    
    # Avec suppression de stopwords
    with_removal = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=False,
        remove_stopwords=True,
        use_lemmatization=True,
        use_stemming=False
    )
    
    print(f"Tweet {i}:")
    print(f"  AVANT                : {tweet}")
    print(f"  SANS SUPPRESSION     : {without_removal}")
    print(f"  AVEC SUPPRESSION     : {with_removal}")
    print(f"  Réduction : {len(without_removal.split())} → {len(with_removal.split())} mots")
    print()

SUPPRESSION DES STOPWORDS

Tweet 1:
  AVANT                : @AirParadis Can't believe they didn't help! Service was NOT good at all!!!
  SANS SUPPRESSION     : can not believe they did not help service wa not good at all
  AVEC SUPPRESSION     : not believe not help service not good
  Réduction : 13 → 7 mots

Tweet 2:
  AVANT                : Great flight! Crew was professional, seats comfortable. Highly recommend!!!
  SANS SUPPRESSION     : great flight crew wa professional seat comfortable highly recommend
  AVEC SUPPRESSION     : great flight crew professional seat comfortable highly recommend
  Réduction : 9 → 8 mots

Tweet 3:
  AVANT                : Can't believe how GOOD @AirParadis is!!! Not a single complaint! PERFECT!!!
  SANS SUPPRESSION     : can not believe how good is not a single complaint perfect
  AVEC SUPPRESSION     : not believe good not single complaint perfect
  Réduction : 11 → 7 mots



## Section 6 : Suppression des Stopwords

Suppression sélective des stopwords (préservation des mots sentimentaux).

**Paramètre** : `remove_stopwords=True`

In [6]:
print("="*80)
print("PIPELINE COMPLET")
print("="*80)
print()

# Tweets complexes pour démontrer le pipeline complet
pipeline_tweets = [
    tweets[0],  # Négatif complexe
    tweets[1],  # Négatif avec émotions
    tweets[3],  # Positif avec émotions
    tweets[8]   # Positif avec négation
]

for i, tweet in enumerate(pipeline_tweets, 1):
    # Pipeline complet (sans handle_negations car inutile pour LSTM)
    processed = cleaner.advanced_preprocess(
        tweet,
        handle_negations=False,
        handle_emotions=True,
        remove_stopwords=True,
        use_lemmatization=True,
        use_stemming=False
    )
    
    print(f"Tweet {i}:")
    print(f"  AVANT : {tweet}")
    print(f"  APRÈS : {processed}")
    print()

PIPELINE COMPLET

Tweet 1:
  AVANT : @AirParadis Can't believe they didn't help! Service was NOT good at all!!!
  APRÈS : not believe not help service not cap good excited

Tweet 2:
  AVANT : I'm SOOO disappointed!!! Won't recommend @AirParadis. TERRIBLE experience!!!
  APRÈS : sooo cap disappointed excited not recommend terrible cap experience excited

Tweet 3:
  AVANT : @AirParadis LOOOOVE your service!!! Can't say enough good things! AMAZING!!!
  APRÈS : loooove cap service excited not say enough good thing amazing cap excited

Tweet 4:
  AVANT : Can't believe how GOOD @AirParadis is!!! Not a single complaint! PERFECT!!!
  APRÈS : not believe good cap excited not single complaint perfect cap excited

