# 01 - Data Setup : Olist E-commerce Database

**Objectif** : Créer une base SQLite avec le dataset Olist pour Text-to-SQL

**Étapes** :
1. Télécharger le dataset Olist
2. Explorer les données
3. Créer la base SQLite (6 tables)
4. Valider avec des requêtes manuelles

**Schéma cible** :
```
customers ──┐
            ├── orders ── order_items ── products
payments ───┘     │
                  └── reviews
```

In [1]:
# Installations
!pip install pandas sqlalchemy kaggle --quiet

In [2]:
import pandas as pd
import sqlite3
from pathlib import Path
import os

# Paths
DATA_DIR = Path('../data')
DB_PATH = DATA_DIR / 'database' / 'ecommerce.db'

print(f'Data dir: {DATA_DIR.absolute()}')
print(f'DB path: {DB_PATH.absolute()}')

Data dir: /Users/mohamed/Downloads/text-to-sql-ecommerce/notebooks/../data
DB path: /Users/mohamed/Downloads/text-to-sql-ecommerce/notebooks/../data/database/ecommerce.db


## 1. Télécharger le dataset Olist

**Option A** : Depuis Kaggle (nécessite API key)
```bash
kaggle datasets download -d olistbr/brazilian-ecommerce
```

**Option B** : Télécharger manuellement depuis :
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

**Option C** : Utiliser les URLs directes (on va faire ça)

In [3]:
# Option C : Télécharger depuis URLs directes (miroir)
import urllib.request

# Créer dossier
RAW_DIR = DATA_DIR / 'raw'
RAW_DIR.mkdir(parents=True, exist_ok=True)

# URLs des fichiers Olist (miroir GitHub)
BASE_URL = 'https://raw.githubusercontent.com/olist/work-at-olist-data/master/datasets/'

FILES = [
    'olist_customers_dataset.csv',
    'olist_orders_dataset.csv',
    'olist_order_items_dataset.csv',
    'olist_products_dataset.csv',
    'olist_order_payments_dataset.csv',
    'olist_order_reviews_dataset.csv'
]

print('Téléchargement des fichiers...')
for f in FILES:
    filepath = RAW_DIR / f
    if not filepath.exists():
        print(f'  Downloading {f}...')
        try:
            urllib.request.urlretrieve(BASE_URL + f, filepath)
            print(f'  ✓ {f}')
        except Exception as e:
            print(f'  ✗ {f}: {e}')
    else:
        print(f'  ✓ {f} (exists)')

print('\nDone!')

Téléchargement des fichiers...
  Downloading olist_customers_dataset.csv...
  ✓ olist_customers_dataset.csv
  Downloading olist_orders_dataset.csv...
  ✓ olist_orders_dataset.csv
  Downloading olist_order_items_dataset.csv...
  ✓ olist_order_items_dataset.csv
  Downloading olist_products_dataset.csv...
  ✓ olist_products_dataset.csv
  Downloading olist_order_payments_dataset.csv...
  ✓ olist_order_payments_dataset.csv
  Downloading olist_order_reviews_dataset.csv...
  ✓ olist_order_reviews_dataset.csv

Done!


In [None]:
# Si le téléchargement échoue, on crée des données synthétiques
# basées sur le schéma Olist

import numpy as np
from datetime import datetime, timedelta
import random

def create_synthetic_data():
    """Créer données synthétiques style Olist."""
    np.random.seed(42)
    
    n_customers = 500
    n_products = 100
    n_orders = 1000
    
    # Customers
    customers = pd.DataFrame({
        'customer_id': [f'cust_{i:04d}' for i in range(n_customers)],
        'customer_city': np.random.choice(['São Paulo', 'Rio de Janeiro', 'Belo Horizonte', 
                                           'Brasília', 'Salvador', 'Curitiba'], n_customers),
        'customer_state': np.random.choice(['SP', 'RJ', 'MG', 'DF', 'BA', 'PR'], n_customers)
    })
    
    # Products
    categories = ['electronics', 'furniture', 'clothing', 'books', 'sports', 'beauty', 'toys', 'food']
    products = pd.DataFrame({
        'product_id': [f'prod_{i:04d}' for i in range(n_products)],
        'product_category': np.random.choice(categories, n_products),
        'product_weight_g': np.random.randint(100, 5000, n_products),
        'product_length_cm': np.random.randint(10, 100, n_products),
        'product_height_cm': np.random.randint(5, 50, n_products),
        'product_width_cm': np.random.randint(5, 50, n_products)
    })
    
    # Orders
    start_date = datetime(2017, 1, 1)
    orders = pd.DataFrame({
        'order_id': [f'order_{i:04d}' for i in range(n_orders)],
        'customer_id': np.random.choice(customers['customer_id'], n_orders),
        'order_status': np.random.choice(['delivered', 'shipped', 'canceled', 'processing'], 
                                         n_orders, p=[0.85, 0.08, 0.02, 0.05]),
        'order_purchase_timestamp': [start_date + timedelta(days=np.random.randint(0, 730)) 
                                     for _ in range(n_orders)],
        'order_delivered_timestamp': None
    })
    orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
    orders['order_delivered_timestamp'] = orders.apply(
        lambda x: x['order_purchase_timestamp'] + timedelta(days=np.random.randint(3, 15)) 
                  if x['order_status'] == 'delivered' else None, axis=1
    )
    
    # Order Items (1-3 items per order)
    order_items_list = []
    for order_id in orders['order_id']:
        n_items = np.random.randint(1, 4)
        for i in range(n_items):
            order_items_list.append({
                'order_id': order_id,
                'order_item_id': i + 1,
                'product_id': np.random.choice(products['product_id']),
                'price': round(np.random.uniform(20, 500), 2),
                'freight_value': round(np.random.uniform(5, 50), 2)
            })
    order_items = pd.DataFrame(order_items_list)
    
    # Payments
    payments_list = []
    for order_id in orders['order_id']:
        order_total = order_items[order_items['order_id'] == order_id]['price'].sum()
        payments_list.append({
            'order_id': order_id,
            'payment_sequential': 1,
            'payment_type': np.random.choice(['credit_card', 'boleto', 'debit_card', 'voucher'],
                                             p=[0.7, 0.2, 0.08, 0.02]),
            'payment_installments': np.random.randint(1, 12),
            'payment_value': round(order_total, 2)
        })
    payments = pd.DataFrame(payments_list)
    
    # Reviews
    reviews_list = []
    delivered_orders = orders[orders['order_status'] == 'delivered']['order_id']
    for order_id in delivered_orders:
        if np.random.random() < 0.7:  # 70% laissent un avis
            reviews_list.append({
                'review_id': f'rev_{len(reviews_list):04d}',
                'order_id': order_id,
                'review_score': np.random.choice([1, 2, 3, 4, 5], p=[0.05, 0.05, 0.1, 0.3, 0.5]),
                'review_comment_title': np.random.choice(['Great!', 'Good', 'OK', 'Bad', None]),
                'review_comment_message': np.random.choice(['Fast delivery', 'As expected', 
                                                            'Could be better', 'Loved it!', None])
            })
    reviews = pd.DataFrame(reviews_list)
    
    return {
        'customers': customers,
        'products': products,
        'orders': orders,
        'order_items': order_items,
        'payments': payments,
        'reviews': reviews
    }

print('Fonction de données synthétiques prête.')

In [4]:
# Charger les données (Olist ou synthétiques)

def load_olist_data():
    """Charger données Olist ou créer synthétiques."""
    
    # Essayer de charger Olist
    try:
        customers = pd.read_csv(RAW_DIR / 'olist_customers_dataset.csv')
        orders = pd.read_csv(RAW_DIR / 'olist_orders_dataset.csv')
        order_items = pd.read_csv(RAW_DIR / 'olist_order_items_dataset.csv')
        products = pd.read_csv(RAW_DIR / 'olist_products_dataset.csv')
        payments = pd.read_csv(RAW_DIR / 'olist_order_payments_dataset.csv')
        reviews = pd.read_csv(RAW_DIR / 'olist_order_reviews_dataset.csv')
        
        print('✓ Données Olist chargées!')
        
        # Renommer colonnes pour simplifier
        customers = customers[['customer_id', 'customer_city', 'customer_state']]
        products = products[['product_id', 'product_category_name', 'product_weight_g',
                            'product_length_cm', 'product_height_cm', 'product_width_cm']]
        products = products.rename(columns={'product_category_name': 'product_category'})
        orders = orders[['order_id', 'customer_id', 'order_status', 
                        'order_purchase_timestamp', 'order_delivered_customer_date']]
        orders = orders.rename(columns={'order_delivered_customer_date': 'order_delivered_timestamp'})
        order_items = order_items[['order_id', 'order_item_id', 'product_id', 'price', 'freight_value']]
        payments = payments[['order_id', 'payment_sequential', 'payment_type', 
                            'payment_installments', 'payment_value']]
        reviews = reviews[['review_id', 'order_id', 'review_score', 
                          'review_comment_title', 'review_comment_message']]
        
        return {
            'customers': customers,
            'products': products,
            'orders': orders,
            'order_items': order_items,
            'payments': payments,
            'reviews': reviews
        }
        
    except Exception as e:
        print(f'Olist non disponible: {e}')
        print('Création de données synthétiques...')
        return create_synthetic_data()

# Charger
data = load_olist_data()

print('\nTables chargées:')
for name, df in data.items():
    print(f'  {name}: {len(df)} rows, {len(df.columns)} cols')

✓ Données Olist chargées!

Tables chargées:
  customers: 99441 rows, 3 cols
  products: 32951 rows, 6 cols
  orders: 99441 rows, 5 cols
  order_items: 112650 rows, 5 cols
  payments: 103886 rows, 5 cols
  reviews: 99224 rows, 5 cols


## 2. Explorer les données

In [5]:
# Aperçu de chaque table
for name, df in data.items():
    print(f'\n{"="*60}')
    print(f'{name.upper()}')
    print('='*60)
    print(f'Shape: {df.shape}')
    print(f'\nColumns: {list(df.columns)}')
    print(f'\nSample:')
    display(df.head(3))


CUSTOMERS
Shape: (99441, 3)

Columns: ['customer_id', 'customer_city', 'customer_state']

Sample:


Unnamed: 0,customer_id,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,sao paulo,SP



PRODUCTS
Shape: (32951, 6)

Columns: ['product_id', 'product_category', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']

Sample:


Unnamed: 0,product_id,product_category,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,154.0,18.0,9.0,15.0



ORDERS
Shape: (99441, 5)

Columns: ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'order_delivered_timestamp']

Sample:


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_delivered_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-10 21:25:13
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-08-07 15:27:45
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-17 18:06:29



ORDER_ITEMS
Shape: (112650, 5)

Columns: ['order_id', 'order_item_id', 'product_id', 'price', 'freight_value']

Sample:


Unnamed: 0,order_id,order_item_id,product_id,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,199.0,17.87



PAYMENTS
Shape: (103886, 5)

Columns: ['order_id', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value']

Sample:


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71



REVIEWS
Shape: (99224, 5)

Columns: ['review_id', 'order_id', 'review_score', 'review_comment_title', 'review_comment_message']

Sample:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,


In [6]:
# Schéma des relations
print('SCHÉMA RELATIONNEL')
print('='*60)
print('''
customers (customer_id PK)
    │
    └──< orders (order_id PK, customer_id FK)
            │
            ├──< order_items (order_id FK, product_id FK)
            │         │
            │         └──> products (product_id PK)
            │
            ├──< payments (order_id FK)
            │
            └──< reviews (order_id FK)
''')
print('\nRelations:')
print('  - customers 1:N orders')
print('  - orders 1:N order_items')
print('  - orders 1:N payments')
print('  - orders 1:1 reviews')
print('  - products 1:N order_items')

SCHÉMA RELATIONNEL

customers (customer_id PK)
    │
    └──< orders (order_id PK, customer_id FK)
            │
            ├──< order_items (order_id FK, product_id FK)
            │         │
            │         └──> products (product_id PK)
            │
            ├──< payments (order_id FK)
            │
            └──< reviews (order_id FK)


Relations:
  - customers 1:N orders
  - orders 1:N order_items
  - orders 1:N payments
  - orders 1:1 reviews
  - products 1:N order_items


## 3. Créer la base SQLite

In [7]:
# Créer la base SQLite
DB_PATH.parent.mkdir(parents=True, exist_ok=True)

# Supprimer si existe
if DB_PATH.exists():
    DB_PATH.unlink()
    print('Base existante supprimée.')

# Créer connexion
conn = sqlite3.connect(DB_PATH)
print(f'Base créée: {DB_PATH}')

# Insérer les tables
for name, df in data.items():
    df.to_sql(name, conn, index=False, if_exists='replace')
    print(f'  ✓ {name}: {len(df)} rows')

conn.commit()
print('\n✓ Base SQLite créée avec succès!')

Base créée: ../data/database/ecommerce.db
  ✓ customers: 99441 rows
  ✓ products: 32951 rows
  ✓ orders: 99441 rows
  ✓ order_items: 112650 rows
  ✓ payments: 103886 rows
  ✓ reviews: 99224 rows

✓ Base SQLite créée avec succès!


In [8]:
# Vérifier les tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

print('Tables dans la base:')
for t in tables:
    cursor.execute(f"SELECT COUNT(*) FROM {t[0]}")
    count = cursor.fetchone()[0]
    print(f'  {t[0]}: {count} rows')

Tables dans la base:
  customers: 99441 rows
  products: 32951 rows
  orders: 99441 rows
  order_items: 112650 rows
  payments: 103886 rows
  reviews: 99224 rows


In [9]:
# Obtenir le schéma DDL
def get_schema(conn):
    """Extraire le schéma DDL de la base."""
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = [t[0] for t in cursor.fetchall()]
    
    schema = []
    for table in tables:
        cursor.execute(f"PRAGMA table_info({table})")
        columns = cursor.fetchall()
        
        cols_def = []
        for col in columns:
            col_name = col[1]
            col_type = col[2] or 'TEXT'
            cols_def.append(f"  {col_name} {col_type}")
        
        schema.append(f"CREATE TABLE {table} (\n" + ",\n".join(cols_def) + "\n);")
    
    return "\n\n".join(schema)

schema_ddl = get_schema(conn)
print('SCHÉMA DDL')
print('='*60)
print(schema_ddl)

SCHÉMA DDL
CREATE TABLE customers (
  customer_id TEXT,
  customer_city TEXT,
  customer_state TEXT
);

CREATE TABLE products (
  product_id TEXT,
  product_category TEXT,
  product_weight_g REAL,
  product_length_cm REAL,
  product_height_cm REAL,
  product_width_cm REAL
);

CREATE TABLE orders (
  order_id TEXT,
  customer_id TEXT,
  order_status TEXT,
  order_purchase_timestamp TEXT,
  order_delivered_timestamp TEXT
);

CREATE TABLE order_items (
  order_id TEXT,
  order_item_id INTEGER,
  product_id TEXT,
  price REAL,
  freight_value REAL
);

CREATE TABLE payments (
  order_id TEXT,
  payment_sequential INTEGER,
  payment_type TEXT,
  payment_installments INTEGER,
  payment_value REAL
);

CREATE TABLE reviews (
  review_id TEXT,
  order_id TEXT,
  review_score INTEGER,
  review_comment_title TEXT,
  review_comment_message TEXT
);


In [10]:
# Sauvegarder le schéma
with open(DATA_DIR / 'database' / 'schema.sql', 'w') as f:
    f.write(schema_ddl)
print(f'Schéma sauvegardé: {DATA_DIR / "database" / "schema.sql"}')

Schéma sauvegardé: ../data/database/schema.sql


## 4. Valider avec des requêtes manuelles

In [11]:
def run_query(sql):
    """Exécuter une requête SQL et afficher le résultat."""
    print(f'SQL: {sql}')
    print('-'*60)
    result = pd.read_sql(sql, conn)
    display(result)
    return result

In [12]:
# Requête 1 : Simple - Nombre total de commandes
print('\n1. SIMPLE: Nombre total de commandes')
run_query("SELECT COUNT(*) as total_orders FROM orders")


1. SIMPLE: Nombre total de commandes
SQL: SELECT COUNT(*) as total_orders FROM orders
------------------------------------------------------------


Unnamed: 0,total_orders
0,99441


Unnamed: 0,total_orders
0,99441


In [13]:
# Requête 2 : Simple - Commandes par status
print('\n2. SIMPLE: Commandes par statut')
run_query("""
SELECT order_status, COUNT(*) as count 
FROM orders 
GROUP BY order_status 
ORDER BY count DESC
""")


2. SIMPLE: Commandes par statut
SQL: 
SELECT order_status, COUNT(*) as count 
FROM orders 
GROUP BY order_status 
ORDER BY count DESC

------------------------------------------------------------


Unnamed: 0,order_status,count
0,delivered,96478
1,shipped,1107
2,canceled,625
3,unavailable,609
4,invoiced,314
5,processing,301
6,created,5
7,approved,2


Unnamed: 0,order_status,count
0,delivered,96478
1,shipped,1107
2,canceled,625
3,unavailable,609
4,invoiced,314
5,processing,301
6,created,5
7,approved,2


In [14]:
# Requête 3 : Medium - Top 5 produits par revenue
print('\n3. MEDIUM: Top 5 catégories par revenue')
run_query("""
SELECT 
    p.product_category,
    SUM(oi.price) as total_revenue,
    COUNT(*) as items_sold
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.product_category
ORDER BY total_revenue DESC
LIMIT 5
""")


3. MEDIUM: Top 5 catégories par revenue
SQL: 
SELECT 
    p.product_category,
    SUM(oi.price) as total_revenue,
    COUNT(*) as items_sold
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.product_category
ORDER BY total_revenue DESC
LIMIT 5

------------------------------------------------------------


Unnamed: 0,product_category,total_revenue,items_sold
0,beleza_saude,1258681.34,9670
1,relogios_presentes,1205005.68,5991
2,cama_mesa_banho,1036988.68,11115
3,esporte_lazer,988048.97,8641
4,informatica_acessorios,911954.32,7827


Unnamed: 0,product_category,total_revenue,items_sold
0,beleza_saude,1258681.34,9670
1,relogios_presentes,1205005.68,5991
2,cama_mesa_banho,1036988.68,11115
3,esporte_lazer,988048.97,8641
4,informatica_acessorios,911954.32,7827


In [15]:
# Requête 4 : Medium - Revenue par méthode de paiement
print('\n4. MEDIUM: Revenue par méthode de paiement')
run_query("""
SELECT 
    payment_type,
    SUM(payment_value) as total_revenue,
    COUNT(*) as num_payments,
    ROUND(AVG(payment_value), 2) as avg_payment
FROM payments
GROUP BY payment_type
ORDER BY total_revenue DESC
""")


4. MEDIUM: Revenue par méthode de paiement
SQL: 
SELECT 
    payment_type,
    SUM(payment_value) as total_revenue,
    COUNT(*) as num_payments,
    ROUND(AVG(payment_value), 2) as avg_payment
FROM payments
GROUP BY payment_type
ORDER BY total_revenue DESC

------------------------------------------------------------


Unnamed: 0,payment_type,total_revenue,num_payments,avg_payment
0,credit_card,12542084.19,76795,163.32
1,boleto,2869361.27,19784,145.03
2,voucher,379436.87,5775,65.7
3,debit_card,217989.79,1529,142.57
4,not_defined,0.0,3,0.0


Unnamed: 0,payment_type,total_revenue,num_payments,avg_payment
0,credit_card,12542084.19,76795,163.32
1,boleto,2869361.27,19784,145.03
2,voucher,379436.87,5775,65.7
3,debit_card,217989.79,1529,142.57
4,not_defined,0.0,3,0.0


In [16]:
# Requête 5 : Complexe - Top clients par nombre de commandes avec note moyenne
print('\n5. COMPLEXE: Top 10 clients par commandes avec note moyenne')
run_query("""
SELECT 
    c.customer_id,
    c.customer_city,
    COUNT(DISTINCT o.order_id) as num_orders,
    ROUND(AVG(r.review_score), 2) as avg_review_score
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN reviews r ON o.order_id = r.order_id
GROUP BY c.customer_id, c.customer_city
HAVING COUNT(DISTINCT o.order_id) > 1
ORDER BY num_orders DESC
LIMIT 10
""")


5. COMPLEXE: Top 10 clients par commandes avec note moyenne
SQL: 
SELECT 
    c.customer_id,
    c.customer_city,
    COUNT(DISTINCT o.order_id) as num_orders,
    ROUND(AVG(r.review_score), 2) as avg_review_score
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN reviews r ON o.order_id = r.order_id
GROUP BY c.customer_id, c.customer_city
HAVING COUNT(DISTINCT o.order_id) > 1
ORDER BY num_orders DESC
LIMIT 10

------------------------------------------------------------


Unnamed: 0,customer_id,customer_city,num_orders,avg_review_score


Unnamed: 0,customer_id,customer_city,num_orders,avg_review_score


In [17]:
# Requête 6 : Complexe - Analyse mensuelle des ventes
print('\n6. COMPLEXE: Analyse mensuelle des ventes')
run_query("""
SELECT 
    strftime('%Y-%m', o.order_purchase_timestamp) as month,
    COUNT(DISTINCT o.order_id) as num_orders,
    SUM(oi.price) as total_revenue,
    ROUND(AVG(oi.price), 2) as avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_status = 'delivered'
GROUP BY strftime('%Y-%m', o.order_purchase_timestamp)
ORDER BY month DESC
LIMIT 12
""")


6. COMPLEXE: Analyse mensuelle des ventes
SQL: 
SELECT 
    strftime('%Y-%m', o.order_purchase_timestamp) as month,
    COUNT(DISTINCT o.order_id) as num_orders,
    SUM(oi.price) as total_revenue,
    ROUND(AVG(oi.price), 2) as avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_status = 'delivered'
GROUP BY strftime('%Y-%m', o.order_purchase_timestamp)
ORDER BY month DESC
LIMIT 12

------------------------------------------------------------


Unnamed: 0,month,num_orders,total_revenue,avg_order_value
0,2018-08,6351,838576.64,117.41
1,2018-07,6159,867953.46,124.65
2,2018-06,6099,856077.86,122.12
3,2018-05,6749,977544.69,125.17
4,2018-04,6798,973534.09,124.38
5,2018-03,7003,953356.25,118.92
6,2018-02,6555,826437.13,109.93
7,2018-01,7069,924645.0,115.05
8,2017-12,5513,726033.19,117.35
9,2017-11,7289,987765.37,116.55


Unnamed: 0,month,num_orders,total_revenue,avg_order_value
0,2018-08,6351,838576.64,117.41
1,2018-07,6159,867953.46,124.65
2,2018-06,6099,856077.86,122.12
3,2018-05,6749,977544.69,125.17
4,2018-04,6798,973534.09,124.38
5,2018-03,7003,953356.25,118.92
6,2018-02,6555,826437.13,109.93
7,2018-01,7069,924645.0,115.05
8,2017-12,5513,726033.19,117.35
9,2017-11,7289,987765.37,116.55


## 5. Créer le jeu de test (questions → SQL)

In [19]:
# Questions business pour le test set
TEST_QUESTIONS = [
    # Simple (SELECT, WHERE, COUNT)
    {
        'question': 'Combien de commandes au total ?',
        'sql': 'SELECT COUNT(*) as total FROM orders',
        'difficulty': 'simple'
    },
    {
        'question': 'Combien de clients différents ?',
        'sql': 'SELECT COUNT(DISTINCT customer_id) as total FROM customers',
        'difficulty': 'simple'
    },
    {
        'question': 'Combien de commandes livrées ?',
        'sql': "SELECT COUNT(*) as total FROM orders WHERE order_status = 'delivered'",
        'difficulty': 'simple'
    },
    {
        'question': 'Quel est le nombre de produits dans le catalogue ?',
        'sql': 'SELECT COUNT(*) as total FROM products',
        'difficulty': 'simple'
    },
    
    # Medium (GROUP BY, JOIN, ORDER BY)
    {
        'question': 'Quelles sont les 5 villes avec le plus de clients ?',
        'sql': '''SELECT customer_city, COUNT(*) as num_customers 
                  FROM customers 
                  GROUP BY customer_city 
                  ORDER BY num_customers DESC 
                  LIMIT 5''',
        'difficulty': 'medium'
    },
    {
        'question': 'Quel est le revenue total par catégorie de produit ?',
        'sql': '''SELECT p.product_category, SUM(oi.price) as revenue
                  FROM order_items oi
                  JOIN products p ON oi.product_id = p.product_id
                  GROUP BY p.product_category
                  ORDER BY revenue DESC''',
        'difficulty': 'medium'
    },
    {
        'question': 'Quelle est la note moyenne par catégorie de produit ?',
        'sql': '''SELECT p.product_category, ROUND(AVG(r.review_score), 2) as avg_score
                  FROM reviews r
                  JOIN orders o ON r.order_id = o.order_id
                  JOIN order_items oi ON o.order_id = oi.order_id
                  JOIN products p ON oi.product_id = p.product_id
                  GROUP BY p.product_category
                  ORDER BY avg_score DESC''',
        'difficulty': 'medium'
    },
    {
        'question': 'Combien de commandes par méthode de paiement ?',
        'sql': '''SELECT payment_type, COUNT(DISTINCT order_id) as num_orders
                  FROM payments
                  GROUP BY payment_type
                  ORDER BY num_orders DESC''',
        'difficulty': 'medium'
    },
    
    # Complexe (sous-requêtes, HAVING, multiple JOINs)
    {
        'question': 'Quels clients ont passé plus de 2 commandes ?',
        'sql': '''SELECT c.customer_id, c.customer_city, COUNT(o.order_id) as num_orders
                  FROM customers c
                  JOIN orders o ON c.customer_id = o.customer_id
                  GROUP BY c.customer_id, c.customer_city
                  HAVING COUNT(o.order_id) > 2
                  ORDER BY num_orders DESC''',
        'difficulty': 'complex'
    },
    {
        'question': 'Quel est le panier moyen par ville ?',
        'sql': '''SELECT c.customer_city, ROUND(AVG(oi.price), 2) as avg_basket
                  FROM customers c
                  JOIN orders o ON c.customer_id = o.customer_id
                  JOIN order_items oi ON o.order_id = oi.order_id
                  GROUP BY c.customer_city
                  ORDER BY avg_basket DESC
                  LIMIT 10''',
        'difficulty': 'complex'
    },
    {
        'question': 'Quels produits ont une note moyenne inférieure à 3 ?',
        'sql': '''SELECT p.product_id, p.product_category, ROUND(AVG(r.review_score), 2) as avg_score
                  FROM products p
                  JOIN order_items oi ON p.product_id = oi.product_id
                  JOIN reviews r ON oi.order_id = r.order_id
                  GROUP BY p.product_id, p.product_category
                  HAVING AVG(r.review_score) < 3
                  ORDER BY avg_score ASC''',
        'difficulty': 'complex'
    },
    {
        'question': 'Quel est le délai moyen de livraison par état ?',
        'sql': '''SELECT c.customer_state,
                         ROUND(AVG(julianday(o.order_delivered_timestamp) - 
                               julianday(o.order_purchase_timestamp)), 1) as avg_delivery_days
                  FROM customers c
                  JOIN orders o ON c.customer_id = o.customer_id
                  WHERE o.order_delivered_timestamp IS NOT NULL
                  GROUP BY c.customer_state
                  ORDER BY avg_delivery_days ASC''',
        'difficulty': 'complex'
    }
]

print(f'Test set créé: {len(TEST_QUESTIONS)} questions')
print(f'  Simple: {len([q for q in TEST_QUESTIONS if q["difficulty"] == "simple"])}')
print(f'  Medium: {len([q for q in TEST_QUESTIONS if q["difficulty"] == "medium"])}')
print(f'  Complex: {len([q for q in TEST_QUESTIONS if q["difficulty"] == "complex"])}')

Test set créé: 12 questions
  Simple: 4
  Medium: 4
  Complex: 4


In [20]:
# Valider que toutes les requêtes fonctionnent
print('Validation des requêtes du test set...')
print('='*60)

for i, q in enumerate(TEST_QUESTIONS):
    try:
        result = pd.read_sql(q['sql'], conn)
        print(f'✓ [{q["difficulty"]}] {q["question"][:50]}...')
    except Exception as e:
        print(f'✗ [{q["difficulty"]}] {q["question"][:50]}...')
        print(f'  Error: {e}')

print('\n✓ Toutes les requêtes sont valides!')

Validation des requêtes du test set...
✓ [simple] Combien de commandes au total ?...
✓ [simple] Combien de clients différents ?...
✓ [simple] Combien de commandes livrées ?...
✓ [simple] Quel est le nombre de produits dans le catalogue ?...
✓ [medium] Quelles sont les 5 villes avec le plus de clients ...
✓ [medium] Quel est le revenue total par catégorie de produit...
✓ [medium] Quelle est la note moyenne par catégorie de produi...
✓ [medium] Combien de commandes par méthode de paiement ?...
✓ [complex] Quels clients ont passé plus de 2 commandes ?...
✓ [complex] Quel est le panier moyen par ville ?...
✓ [complex] Quels produits ont une note moyenne inférieure à 3...
✓ [complex] Quel est le délai moyen de livraison par état ?...

✓ Toutes les requêtes sont valides!


In [21]:
# Sauvegarder le test set
import json

with open(DATA_DIR / 'results' / 'test_questions.json', 'w') as f:
    json.dump(TEST_QUESTIONS, f, indent=2)

print(f'Test set sauvegardé: {DATA_DIR / "results" / "test_questions.json"}')

Test set sauvegardé: ../data/results/test_questions.json


## Résumé

### Ce qui a été fait :
1.  Dataset Olist chargé (ou données synthétiques créées)
2.  Base SQLite créée avec 6 tables
3.  Schéma DDL exporté
4.  Requêtes manuelles validées (simple, medium, complexe)
5.  Test set créé (12 questions)

### Prochaine étape :
**Notebook 02** : Tester un LLM basique (zero-shot) pour établir le baseline

In [22]:
# Fermer connexion
conn.close()
print('Connexion fermée.')

Connexion fermée.
