# 📝 Résumé Automatique de Texte - Version Fonctionnelle

Ce notebook implémente plusieurs techniques de résumé automatique pour extraire les informations essentielles des textes.

## 🎯 Objectifs:
- 📊 Résumé extractif avec TF-IDF et TextRank
- 🤖 Résumé abstractif avec Transformers (si disponible)
- 📈 Évaluation avec métriques ROUGE
- 🔍 Comparaison des différentes approches
- 💾 Sauvegarde des résumés générés

In [1]:
# 📦 Import des bibliothèques essentielles
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import re
from collections import Counter
import math
warnings.filterwarnings('ignore')

# Configuration des graphiques
plt.style.use('seaborn-v0_8')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 8)

print('✅ Bibliothèques de base importées!')

# Import NLTK pour la tokenisation
nltk_available = False
try:
    import nltk
    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    
    # Télécharger les ressources nécessaires
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    
    nltk_available = True
    print('✅ NLTK disponible!')
except ImportError:
    print('⚠️ NLTK non disponible - utilisation de méthodes alternatives')

# Import scikit-learn pour TF-IDF
sklearn_available = False
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    sklearn_available = True
    print('✅ Scikit-learn disponible!')
except ImportError:
    print('⚠️ Scikit-learn non disponible')

# Import des bibliothèques avancées (optionnelles)
transformers_available = False
try:
    from transformers import pipeline
    transformers_available = True
    print('🚀 Transformers disponible!')
except ImportError:
    print('⚠️ Transformers non disponible - résumé extractif uniquement')

# NetworkX pour TextRank (optionnel)
networkx_available = False
try:
    import networkx as nx
    networkx_available = True
    print('✅ NetworkX disponible!')
except ImportError:
    print('⚠️ NetworkX non disponible - utilisation d\'algorithmes alternatifs')

# Créer les dossiers nécessaires
os.makedirs('../visualizations', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)
print('📁 Dossiers créés!')

✅ Bibliothèques de base importées!
✅ NLTK disponible!
✅ Scikit-learn disponible!
🚀 Transformers disponible!
✅ NetworkX disponible!
📁 Dossiers créés!


## 📂 1. Chargement et Préparation des Données

In [2]:
# 📊 Charger les données
try:
    df = pd.read_csv('../data/processed/processed_news_data.csv')
    print(f"✅ Données chargées: {df.shape}")
except FileNotFoundError:
    print("❌ Fichier processed_news_data.csv non trouvé.")
    print("🔄 Création d'un dataset d'exemple avec textes longs...")
    
    # Dataset d'exemple avec textes plus longs pour le résumé
    sample_texts = [
        """Artificial intelligence has revolutionized the technology industry in unprecedented ways. Companies like OpenAI, Google, and Microsoft are investing billions of dollars in AI research and development. The recent breakthrough in large language models has enabled new applications in natural language processing, computer vision, and robotics. These advances are transforming industries from healthcare to finance, creating new opportunities while also raising important questions about ethics and job displacement. Experts predict that AI will continue to evolve rapidly, with potential applications in autonomous vehicles, personalized medicine, and climate change solutions. However, the development of AI also requires careful consideration of safety, privacy, and fairness to ensure that these powerful technologies benefit all of humanity.""",
        
        """Climate change represents one of the most pressing challenges of our time, affecting ecosystems, weather patterns, and human societies worldwide. Rising global temperatures have led to melting ice caps, rising sea levels, and more frequent extreme weather events. Scientists from the Intergovernmental Panel on Climate Change have documented clear evidence of human activities contributing to greenhouse gas emissions. The transition to renewable energy sources such as solar, wind, and hydroelectric power is crucial for reducing carbon emissions. Many countries have committed to net-zero emissions targets by 2050, requiring significant changes in energy production, transportation, and industrial processes. Individual actions, such as reducing energy consumption and supporting sustainable practices, also play an important role in addressing this global challenge.""",
        
        """The global economy has experienced significant volatility in recent years, influenced by factors such as the COVID-19 pandemic, geopolitical tensions, and supply chain disruptions. Central banks worldwide have implemented various monetary policies to stabilize markets and support economic recovery. Inflation rates have fluctuated, affecting consumer purchasing power and business investment decisions. The rise of digital currencies and fintech innovations has transformed traditional banking and payment systems. International trade relationships continue to evolve, with new agreements and partnerships reshaping global commerce. Economists emphasize the importance of sustainable economic growth that balances prosperity with environmental and social considerations for long-term stability.""",
        
        """Healthcare systems around the world have undergone dramatic transformations, particularly in response to global health challenges. The development and distribution of vaccines have demonstrated the power of international scientific collaboration. Telemedicine and digital health technologies have expanded access to medical care, especially in remote and underserved communities. Precision medicine, powered by genomics and artificial intelligence, is enabling more personalized treatment approaches. Mental health awareness has increased significantly, leading to better support systems and reduced stigma. Healthcare professionals continue to advocate for preventive care and public health measures to improve population health outcomes while managing rising healthcare costs.""",
        
        """Education has evolved rapidly with the integration of digital technologies and online learning platforms. The pandemic accelerated the adoption of remote learning, highlighting both opportunities and challenges in educational delivery. Students and educators have adapted to new tools and methodologies, from virtual classrooms to interactive learning applications. The concept of lifelong learning has become increasingly important as job markets evolve and new skills are required. Educational institutions are exploring innovative approaches such as competency-based learning, micro-credentials, and partnerships with industry. Access to quality education remains a global priority, with efforts to bridge digital divides and ensure equitable learning opportunities for all students."""
    ]
    
    df = pd.DataFrame({
        'id': range(1, len(sample_texts) + 1),
        'text': sample_texts,
        'category': ['technology', 'environment', 'economy', 'health', 'education'],
        'sentiment': ['positive', 'neutral', 'negative', 'positive', 'neutral']
    })
    print(f"✅ Dataset d'exemple créé: {df.shape}")

# Vérifier les colonnes nécessaires
text_column = 'text'
if text_column not in df.columns:
    print("❌ Colonne 'text' non trouvée")
    exit()

# Filtrer les textes suffisamment longs pour le résumé
min_length = 200  # Minimum 200 caractères
df_long = df[df[text_column].str.len() >= min_length].copy()

print(f"📊 Documents originaux: {len(df)}")
print(f"📝 Documents suffisamment longs: {len(df_long)}")
print(f"📏 Longueur moyenne: {df_long[text_column].str.len().mean():.1f} caractères")

if len(df_long) == 0:
    print("⚠️ Aucun document suffisamment long pour le résumé")
    df_long = df.copy()  # Utiliser tous les documents

# Aperçu des données
print("\n📋 Aperçu des textes à résumer:")
for i in range(min(2, len(df_long))):
    text = df_long[text_column].iloc[i]
    print(f"\n📄 Document {i+1} ({len(text)} caractères):")
    print(f"   {text[:150]}...")

✅ Données chargées: (10, 9)
📊 Documents originaux: 10
📝 Documents suffisamment longs: 10
📏 Longueur moyenne: 385.2 caractères

📋 Aperçu des textes à résumer:

📄 Document 1 (461 caractères):
   A major technology company has announced a groundbreaking advancement in artificial intelligence that promises to transform how we interact with machi...

📄 Document 2 (412 caractères):
   World leaders have reached a historic agreement at the latest climate change summit, committing to ambitious targets for reducing greenhouse gas emiss...


## 🔧 2. Fonctions Utilitaires pour le Résumé

In [3]:
# 🛠️ Fonctions utilitaires pour le traitement de texte

def simple_sentence_tokenize(text):
    """Tokenisation simple des phrases si NLTK n'est pas disponible"""
    if nltk_available:
        return sent_tokenize(text)
    else:
        # Méthode simple basée sur la ponctuation
        sentences = re.split(r'[.!?]+', text)
        return [s.strip() for s in sentences if s.strip()]

def simple_word_tokenize(text):
    """Tokenisation simple des mots"""
    if nltk_available:
        return word_tokenize(text.lower())
    else:
        # Méthode simple
        words = re.findall(r'\b\w+\b', text.lower())
        return words

def get_stop_words():
    """Obtenir la liste des mots vides"""
    if nltk_available:
        try:
            return set(stopwords.words('english'))
        except:
            pass
    
    # Liste de mots vides de base
    return {
        'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
        'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
        'this', 'that', 'these', 'those', 'a', 'an', 'as', 'if', 'it', 'its',
        'will', 'would', 'could', 'should', 'may', 'might', 'can', 'must'
    }

def calculate_word_frequencies(text):
    """Calculer les fréquences des mots"""
    words = simple_word_tokenize(text)
    stop_words = get_stop_words()
    
    # Filtrer les mots vides et courts
    filtered_words = [word for word in words if word not in stop_words and len(word) > 2]
    
    # Compter les fréquences
    word_freq = Counter(filtered_words)
    
    # Normaliser les fréquences
    max_freq = max(word_freq.values()) if word_freq else 1
    for word in word_freq:
        word_freq[word] = word_freq[word] / max_freq
    
    return word_freq

def score_sentences(sentences, word_freq):
    """Scorer les phrases basé sur les fréquences des mots"""
    sentence_scores = {}
    
    for sentence in sentences:
        words = simple_word_tokenize(sentence)
        score = 0
        word_count = 0
        
        for word in words:
            if word in word_freq:
                score += word_freq[word]
                word_count += 1
        
        if word_count > 0:
            sentence_scores[sentence] = score / word_count
        else:
            sentence_scores[sentence] = 0
    
    return sentence_scores

print("✅ Fonctions utilitaires définies!")
print(f"📝 NLTK disponible: {nltk_available}")
print(f"🔤 Nombre de mots vides: {len(get_stop_words())}")

✅ Fonctions utilitaires définies!
📝 NLTK disponible: True
🔤 Nombre de mots vides: 198


## 📊 3. Résumé Extractif avec Fréquences TF-IDF

In [4]:
# 🎯 Résumé extractif basé sur les fréquences des mots
def frequency_based_summary(text, num_sentences=3):
    """Résumé extractif basé sur les fréquences des mots"""
    sentences = simple_sentence_tokenize(text)
    
    if len(sentences) <= num_sentences:
        return text
    
    # Calculer les fréquences des mots
    word_freq = calculate_word_frequencies(text)
    
    # Scorer les phrases
    sentence_scores = score_sentences(sentences, word_freq)
    
    # Sélectionner les meilleures phrases
    best_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences]
    
    # Réordonner selon l'ordre original
    summary_sentences = []
    for sentence in sentences:
        if any(sentence == best[0] for best in best_sentences):
            summary_sentences.append(sentence)
    
    return ' '.join(summary_sentences)

# Résumé avec TF-IDF si scikit-learn est disponible
def tfidf_based_summary(text, num_sentences=3):
    """Résumé extractif basé sur TF-IDF"""
    if not sklearn_available:
        return frequency_based_summary(text, num_sentences)
    
    sentences = simple_sentence_tokenize(text)
    
    if len(sentences) <= num_sentences:
        return text
    
    try:
        # Vectorisation TF-IDF
        vectorizer = TfidfVectorizer(stop_words='english', max_features=100)
        tfidf_matrix = vectorizer.fit_transform(sentences)
        
        # Calculer les scores des phrases (somme des scores TF-IDF)
        sentence_scores = tfidf_matrix.sum(axis=1).A1
        
        # Sélectionner les meilleures phrases
        top_indices = sentence_scores.argsort()[-num_sentences:][::-1]
        
        # Réordonner selon l'ordre original
        summary_sentences = [sentences[i] for i in sorted(top_indices)]
        
        return ' '.join(summary_sentences)
    
    except Exception as e:
        print(f"⚠️ Erreur TF-IDF: {e}")
        return frequency_based_summary(text, num_sentences)

print("🎯 RÉSUMÉ EXTRACTIF")
print("=" * 40)

# Appliquer le résumé extractif
print("🔄 Génération des résumés extractifs...")

df_long['frequency_summary'] = df_long[text_column].apply(
    lambda x: frequency_based_summary(x, num_sentences=3)
)

df_long['tfidf_summary'] = df_long[text_column].apply(
    lambda x: tfidf_based_summary(x, num_sentences=3)
)

print("✅ Résumés extractifs générés!")

# Calculer les statistiques de compression
original_lengths = df_long[text_column].str.len()
freq_summary_lengths = df_long['frequency_summary'].str.len()
tfidf_summary_lengths = df_long['tfidf_summary'].str.len()

freq_compression = (1 - freq_summary_lengths / original_lengths) * 100
tfidf_compression = (1 - tfidf_summary_lengths / original_lengths) * 100

print(f"\n📊 STATISTIQUES DE COMPRESSION:")
print(f"  📏 Longueur originale moyenne: {original_lengths.mean():.0f} caractères")
print(f"  🔤 Résumé fréquence moyenne: {freq_summary_lengths.mean():.0f} caractères")
print(f"  📊 Résumé TF-IDF moyenne: {tfidf_summary_lengths.mean():.0f} caractères")
print(f"  📉 Compression fréquence: {freq_compression.mean():.1f}%")
print(f"  📉 Compression TF-IDF: {tfidf_compression.mean():.1f}%")

🎯 RÉSUMÉ EXTRACTIF
🔄 Génération des résumés extractifs...
✅ Résumés extractifs générés!

📊 STATISTIQUES DE COMPRESSION:
  📏 Longueur originale moyenne: 385 caractères
  🔤 Résumé fréquence moyenne: 385 caractères
  📊 Résumé TF-IDF moyenne: 385 caractères
  📉 Compression fréquence: 0.0%
  📉 Compression TF-IDF: 0.0%
