# üåè Scraper de Changements R√©glementaires - March√©s Asiatiques

Ce notebook scrape les changements r√©glementaires des march√©s boursiers de:
- üáπüá≠ **Tha√Ølande** (SEC Thailand)
- üá∏üá¨ **Singapour** (MAS - Monetary Authority of Singapore)
- üá≤üáæ **Malaisie** (SC Malaysia)

## üìã M√©thodologie
- Utilisation de **Selenium** pour les sites dynamiques
- **BeautifulSoup** pour le parsing HTML
- **Pandas** pour l'analyse des donn√©es
- Export vers CSV et Excel

In [None]:
# Import des biblioth√®ques essentielles
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
import re
import warnings
warnings.filterwarnings('ignore')

# Selenium pour le scraping dynamique
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException, NoSuchElementException

# BeautifulSoup pour le parsing
from bs4 import BeautifulSoup

# Requests pour les requ√™tes HTTP simples
import requests

print("‚úÖ Biblioth√®ques import√©es avec succ√®s!")
print(f"üìÖ Date d'ex√©cution: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## ‚öôÔ∏è Configuration du Driver Selenium

In [None]:
def setup_driver(headless=True):
    """
    Configure le driver Selenium Chrome
    
    Args:
        headless: Si True, ex√©cute le navigateur en mode headless (sans interface)
    
    Returns:
        driver: Instance du WebDriver Chrome
    """
    chrome_options = Options()
    
    if headless:
        chrome_options.add_argument('--headless')
    
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--window-size=1920,1080')
    chrome_options.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36')
    
    # D√©sactiver les notifications
    chrome_options.add_experimental_option('prefs', {
        'profile.default_content_setting_values.notifications': 2
    })
    
    driver = webdriver.Chrome(options=chrome_options)
    driver.implicitly_wait(10)
    
    return driver

print("‚úÖ Fonction de configuration du driver cr√©√©e")

## üáπüá≠ 1. Scraping - SEC Thailand

In [None]:
def scrape_sec_thailand():
    """
    Scrape les changements r√©glementaires de la SEC Thailand
    Site: https://www.sec.or.th/EN/Pages/News_Detail.aspx
    """
    print("üáπüá≠ Scraping SEC Thailand...")
    
    url = "https://www.sec.or.th/EN/Pages/News_All.aspx"
    
    try:
        # Utiliser requests pour cette page (plus rapide)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=15)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        regulations = []
        
        # Recherche des articles de news/r√©gulations
        news_items = soup.find_all(['div', 'article', 'li'], class_=re.compile('news|item|article', re.I))
        
        if not news_items:
            # Alternative: chercher tous les liens avec dates
            news_items = soup.find_all('a', href=True)
        
        for item in news_items[:30]:  # Limiter aux 30 premiers
            try:
                # Extraction du titre
                title_elem = item.find(['h2', 'h3', 'h4', 'a']) or item
                title = title_elem.get_text(strip=True)
                
                # Filter pour les r√©gulations
                if len(title) < 10 or not any(word in title.lower() for word in 
                    ['regulation', 'rule', 'law', 'act', 'requirement', 'guideline', 'circular']):
                    continue
                
                # Extraction de la date
                date_text = item.get_text()
                date_match = re.search(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2}', date_text)
                date = date_match.group(0) if date_match else 'N/A'
                
                # Extraction du lien
                link_elem = item.find('a', href=True) or (item if item.name == 'a' else None)
                link = link_elem['href'] if link_elem else ''
                if link and not link.startswith('http'):
                    link = 'https://www.sec.or.th' + link
                
                regulations.append({
                    'Country': 'Thailand üáπüá≠',
                    'Authority': 'SEC Thailand',
                    'Title': title,
                    'Date': date,
                    'Link': link,
                    'Scraped_At': datetime.now().strftime('%Y-%m-%d %H:%M')
                })
            except Exception as e:
                continue
        
        print(f"‚úÖ SEC Thailand: {len(regulations)} r√©gulations trouv√©es")
        return regulations
        
    except Exception as e:
        print(f"‚ùå Erreur SEC Thailand: {e}")
        return []

# Test de la fonction
thailand_data = scrape_sec_thailand()
if thailand_data:
    df_thailand = pd.DataFrame(thailand_data)
    print("\nüìä Aper√ßu des donn√©es Tha√Ølande:")
    display(df_thailand.head())

## üá∏üá¨ 2. Scraping - MAS Singapore

In [None]:
def scrape_mas_singapore():
    """
    Scrape les changements r√©glementaires de MAS Singapore
    Site: https://www.mas.gov.sg/news
    """
    print("üá∏üá¨ Scraping MAS Singapore...")
    
    url = "https://www.mas.gov.sg/news/regulations-and-circulars"
    
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=15)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        regulations = []
        
        # Recherche des √©l√©ments de r√©gulation
        items = soup.find_all(['div', 'article', 'li'], class_=re.compile('card|item|news|article', re.I))
        
        if not items:
            items = soup.find_all('a', href=re.compile('regulation|circular|notice', re.I))
        
        for item in items[:30]:
            try:
                # Titre
                title_elem = item.find(['h2', 'h3', 'h4', 'span', 'a'])
                if not title_elem:
                    continue
                title = title_elem.get_text(strip=True)
                
                if len(title) < 10:
                    continue
                
                # Date
                date_elem = item.find(['time', 'span'], class_=re.compile('date', re.I))
                if date_elem:
                    date = date_elem.get_text(strip=True)
                else:
                    date_match = re.search(r'\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4}', item.get_text())
                    date = date_match.group(0) if date_match else 'N/A'
                
                # Lien
                link_elem = item.find('a', href=True)
                link = link_elem['href'] if link_elem else ''
                if link and not link.startswith('http'):
                    link = 'https://www.mas.gov.sg' + link
                
                regulations.append({
                    'Country': 'Singapore üá∏üá¨',
                    'Authority': 'MAS',
                    'Title': title,
                    'Date': date,
                    'Link': link,
                    'Scraped_At': datetime.now().strftime('%Y-%m-%d %H:%M')
                })
            except:
                continue
        
        print(f"‚úÖ MAS Singapore: {len(regulations)} r√©gulations trouv√©es")
        return regulations
        
    except Exception as e:
        print(f"‚ùå Erreur MAS Singapore: {e}")
        return []

# Test de la fonction
singapore_data = scrape_mas_singapore()
if singapore_data:
    df_singapore = pd.DataFrame(singapore_data)
    print("\nüìä Aper√ßu des donn√©es Singapour:")
    display(df_singapore.head())

## üá≤üáæ 3. Scraping - SC Malaysia

In [None]:
def scrape_sc_malaysia():
    """
    Scrape les changements r√©glementaires de SC Malaysia
    Site: https://www.sc.com.my
    """
    print("üá≤üáæ Scraping SC Malaysia...")
    
    url = "https://www.sc.com.my/regulation/guidelines"
    
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=15)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        regulations = []
        
        # Recherche des items
        items = soup.find_all(['div', 'article', 'li', 'tr'], class_=re.compile('item|card|row|news', re.I))
        
        if not items:
            items = soup.find_all('a', href=True)
        
        for item in items[:30]:
            try:
                # Titre
                title_elem = item.find(['h2', 'h3', 'h4', 'td', 'span', 'a'])
                if not title_elem:
                    continue
                title = title_elem.get_text(strip=True)
                
                if len(title) < 10:
                    continue
                
                # Date
                date_elem = item.find(['time', 'span', 'td'], class_=re.compile('date', re.I))
                if date_elem:
                    date = date_elem.get_text(strip=True)
                else:
                    date_match = re.search(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2}', item.get_text())
                    date = date_match.group(0) if date_match else 'N/A'
                
                # Lien
                link_elem = item.find('a', href=True)
                link = link_elem['href'] if link_elem else ''
                if link and not link.startswith('http'):
                    link = 'https://www.sc.com.my' + link
                
                regulations.append({
                    'Country': 'Malaysia üá≤üáæ',
                    'Authority': 'SC Malaysia',
                    'Title': title,
                    'Date': date,
                    'Link': link,
                    'Scraped_At': datetime.now().strftime('%Y-%m-%d %H:%M')
                })
            except:
                continue
        
        print(f"‚úÖ SC Malaysia: {len(regulations)} r√©gulations trouv√©es")
        return regulations
        
    except Exception as e:
        print(f"‚ùå Erreur SC Malaysia: {e}")
        return []

# Test de la fonction
malaysia_data = scrape_sc_malaysia()
if malaysia_data:
    df_malaysia = pd.DataFrame(malaysia_data)
    print("\nüìä Aper√ßu des donn√©es Malaisie:")
    display(df_malaysia.head())

## üîÑ 4. Scraping avec Selenium (pour sites dynamiques)

In [None]:
def scrape_with_selenium(url, country, authority):
    """
    Scraping g√©n√©rique avec Selenium pour les sites n√©cessitant JavaScript
    """
    print(f"üîÑ Scraping {country} avec Selenium...")
    
    driver = None
    try:
        driver = setup_driver(headless=True)
        driver.get(url)
        
        # Attendre le chargement de la page
        time.sleep(3)
        
        # Scroll pour charger le contenu dynamique
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        
        # Parser le contenu
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        regulations = []
        
        # Recherche flexible d'articles
        items = soup.find_all(['article', 'div', 'li'], 
                             class_=re.compile('news|item|card|article|regulation', re.I))
        
        for item in items[:25]:
            try:
                title_elem = item.find(['h1', 'h2', 'h3', 'h4', 'a'])
                if not title_elem:
                    continue
                    
                title = title_elem.get_text(strip=True)
                if len(title) < 15:
                    continue
                
                # Date
                date_elem = item.find(['time', 'span'], class_=re.compile('date', re.I))
                date = date_elem.get_text(strip=True) if date_elem else 'N/A'
                
                # Lien
                link_elem = item.find('a', href=True)
                link = link_elem['href'] if link_elem else ''
                
                regulations.append({
                    'Country': country,
                    'Authority': authority,
                    'Title': title,
                    'Date': date,
                    'Link': link,
                    'Scraped_At': datetime.now().strftime('%Y-%m-%d %H:%M')
                })
            except:
                continue
        
        print(f"‚úÖ {len(regulations)} √©l√©ments trouv√©s")
        return regulations
        
    except Exception as e:
        print(f"‚ùå Erreur: {e}")
        return []
    finally:
        if driver:
            driver.quit()

print("‚úÖ Fonction Selenium cr√©√©e")

## üìä 5. Consolidation et Analyse des Donn√©es

In [None]:
# Consolidation de toutes les donn√©es
all_regulations = []

if thailand_data:
    all_regulations.extend(thailand_data)
if singapore_data:
    all_regulations.extend(singapore_data)
if malaysia_data:
    all_regulations.extend(malaysia_data)

# Cr√©ation du DataFrame consolid√©
df_all = pd.DataFrame(all_regulations)

print("="*80)
print("üìä R√âSUM√â DES DONN√âES COLLECT√âES")
print("="*80)
print(f"\nüìà Total de r√©gulations: {len(df_all)}")
print(f"\nüìç R√©partition par pays:")
print(df_all['Country'].value_counts())
print(f"\nüèõÔ∏è R√©partition par autorit√©:")
print(df_all['Authority'].value_counts())

print("\n" + "="*80)
print("üìã APER√áU DES DONN√âES")
print("="*80)
display(df_all.head(10))

# Statistiques
print("\nüìä Informations du dataset:")
print(df_all.info())

## üîç 6. Filtrage et Recherche

In [None]:
def search_regulations(df, keyword, country=None):
    """
    Recherche dans les r√©gulations par mot-cl√©
    
    Args:
        df: DataFrame des r√©gulations
        keyword: Mot-cl√© √† rechercher
        country: Pays optionnel pour filtrer
    """
    mask = df['Title'].str.contains(keyword, case=False, na=False)
    
    if country:
        mask &= df['Country'].str.contains(country, case=False, na=False)
    
    results = df[mask]
    
    print(f"üîç R√©sultats pour '{keyword}': {len(results)} trouv√©s")
    return results

# Exemples de recherches
print("üìå Exemple 1: Recherche 'ESG' (Environmental, Social, Governance)")
esg_results = search_regulations(df_all, 'ESG')
if len(esg_results) > 0:
    display(esg_results[['Country', 'Title', 'Date']].head())

print("\nüìå Exemple 2: Recherche 'disclosure'")
disclosure_results = search_regulations(df_all, 'disclosure')
if len(disclosure_results) > 0:
    display(disclosure_results[['Country', 'Title', 'Date']].head())

print("\nüìå Exemple 3: Recherche 'crypto' ou 'digital'")
crypto_mask = df_all['Title'].str.contains('crypto|digital|blockchain', case=False, na=False)
crypto_results = df_all[crypto_mask]
print(f"üîç {len(crypto_results)} r√©gulations li√©es aux crypto/digital trouv√©es")
if len(crypto_results) > 0:
    display(crypto_results[['Country', 'Title', 'Date']].head())

## üìä 7. Visualisations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration du style
plt.style.use('dark_background')
sns.set_palette("husl")

# Cr√©ation des visualisations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('üìä Analyse des Changements R√©glementaires - March√©s Asiatiques', 
             fontsize=16, fontweight='bold', y=1.02)

# 1. R√©partition par pays
country_counts = df_all['Country'].value_counts()
axes[0, 0].pie(country_counts.values, labels=country_counts.index, autopct='%1.1f%%',
               colors=['#ff6b6b', '#4ecdc4', '#45b7d1'])
axes[0, 0].set_title('R√©partition par Pays', fontweight='bold')

# 2. Nombre par autorit√©
authority_counts = df_all['Authority'].value_counts()
axes[0, 1].barh(authority_counts.index, authority_counts.values, 
                color=['#ff6b6b', '#4ecdc4', '#45b7d1'])
axes[0, 1].set_xlabel('Nombre de r√©gulations')
axes[0, 1].set_title('R√©gulations par Autorit√©', fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# 3. Longueur des titres
df_all['Title_Length'] = df_all['Title'].str.len()
axes[1, 0].hist(df_all['Title_Length'], bins=20, color='#26de81', alpha=0.7, edgecolor='white')
axes[1, 0].set_xlabel('Longueur du titre (caract√®res)')
axes[1, 0].set_ylabel('Fr√©quence')
axes[1, 0].set_title('Distribution de la Longueur des Titres', fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# 4. Top mots-cl√©s
import collections
all_words = ' '.join(df_all['Title']).lower()
words = re.findall(r'\b[a-z]{4,}\b', all_words)
word_freq = collections.Counter(words)
common_words = dict(word_freq.most_common(10))

axes[1, 1].barh(list(common_words.keys()), list(common_words.values()),
                color='#ffa502')
axes[1, 1].set_xlabel('Fr√©quence')
axes[1, 1].set_title('Top 10 Mots-Cl√©s', fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Visualisations cr√©√©es avec succ√®s")

## üíæ 8. Export des Donn√©es

In [None]:
# Export vers CSV
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_filename = f'regulatory_changes_asia_{timestamp}.csv'
df_all.to_csv(csv_filename, index=False, encoding='utf-8-sig')
print(f"‚úÖ Donn√©es export√©es vers: {csv_filename}")

# Export vers Excel avec formatage
excel_filename = f'regulatory_changes_asia_{timestamp}.xlsx'
with pd.ExcelWriter(excel_filename, engine='openpyxl') as writer:
    # Sheet 1: Toutes les donn√©es
    df_all.to_excel(writer, sheet_name='All Regulations', index=False)
    
    # Sheet 2: Thailand
    df_thailand = df_all[df_all['Country'].str.contains('Thailand')]
    if len(df_thailand) > 0:
        df_thailand.to_excel(writer, sheet_name='Thailand', index=False)
    
    # Sheet 3: Singapore
    df_singapore = df_all[df_all['Country'].str.contains('Singapore')]
    if len(df_singapore) > 0:
        df_singapore.to_excel(writer, sheet_name='Singapore', index=False)
    
    # Sheet 4: Malaysia
    df_malaysia = df_all[df_all['Country'].str.contains('Malaysia')]
    if len(df_malaysia) > 0:
        df_malaysia.to_excel(writer, sheet_name='Malaysia', index=False)
    
    # Sheet 5: Summary
    summary_data = {
        'M√©trique': [
            'Total R√©gulations',
            'Thailand',
            'Singapore',
            'Malaysia',
            'Date de scraping'
        ],
        'Valeur': [
            len(df_all),
            len(df_thailand),
            len(df_singapore),
            len(df_malaysia),
            datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        ]
    }
    pd.DataFrame(summary_data).to_excel(writer, sheet_name='Summary', index=False)

print(f"‚úÖ Donn√©es export√©es vers: {excel_filename}")

print("\n" + "="*80)
print("‚úÖ EXPORT TERMIN√â")
print("="*80)
print(f"üìÅ Fichiers cr√©√©s:")
print(f"   - {csv_filename}")
print(f"   - {excel_filename}")

## üîÑ 9. Fonction de Scraping Compl√®te (One-Click)

In [None]:
def scrape_all_markets(export=True):
    """
    Fonction principale pour scraper tous les march√©s d'un coup
    
    Args:
        export: Si True, exporte automatiquement les r√©sultats
    
    Returns:
        DataFrame consolid√© de toutes les r√©gulations
    """
    print("üöÄ D√©marrage du scraping complet...")
    print("="*80)
    
    all_data = []
    
    # Thailand
    try:
        th_data = scrape_sec_thailand()
        all_data.extend(th_data)
        time.sleep(2)
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur Thailand: {e}")
    
    # Singapore
    try:
        sg_data = scrape_mas_singapore()
        all_data.extend(sg_data)
        time.sleep(2)
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur Singapore: {e}")
    
    # Malaysia
    try:
        my_data = scrape_sc_malaysia()
        all_data.extend(my_data)
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur Malaysia: {e}")
    
    # Cr√©ation du DataFrame
    df = pd.DataFrame(all_data)
    
    print("="*80)
    print(f"‚úÖ Scraping termin√©: {len(df)} r√©gulations collect√©es")
    
    # Export automatique
    if export and len(df) > 0:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f'asia_regulations_{timestamp}.csv'
        df.to_csv(filename, index=False, encoding='utf-8-sig')
        print(f"üíæ Donn√©es export√©es: {filename}")
    
    return df

print("‚úÖ Fonction de scraping compl√®te cr√©√©e")
print("üí° Utilisez: df = scrape_all_markets() pour lancer le scraping complet")

## üìù 10. Notes et Am√©liorations Possibles

### ‚úÖ Fonctionnalit√©s actuelles:
- Scraping de 3 march√©s asiatiques (TH, SG, MY)
- Support de Selenium pour sites dynamiques
- Export CSV et Excel
- Recherche par mots-cl√©s
- Visualisations

### üöÄ Am√©liorations possibles:
1. **Scheduling automatique**: Utiliser `schedule` pour scraper quotidiennement
2. **Base de donn√©es**: Stocker dans SQLite/PostgreSQL pour historique
3. **Notifications**: Email/Slack quand nouvelles r√©gulations d√©tect√©es
4. **NLP**: Analyse de sentiment et extraction d'entit√©s
5. **Plus de march√©s**: Hong Kong, Japon, Cor√©e du Sud
6. **API REST**: Exposer les donn√©es via Flask/FastAPI
7. **Dashboard**: Streamlit ou Dash pour visualisation interactive

### ‚ö†Ô∏è Notes importantes:
- Respecter les `robots.txt` des sites
- Ajouter des delays entre requ√™tes (rate limiting)
- V√©rifier r√©guli√®rement si la structure HTML a chang√©
- Utiliser des proxies pour √©viter les blocages IP