## Processing ETL des sets de donn√©es Olist

#### 1. Nettoyage: Op√©rations globales √† tous les fichiers source

Dans cette partie je g√®re les nettoyages qui peuvent potentiellement s'appliquer √† toutes les tables, sans faire de traitement sp√©cifique pour telle ou telle table

In [2]:
# Import de d√©pendances et chargement des datasets en m√©moire vive

import os
from pathlib import Path
import pandas as pd
from scripts.utils import clean_data

data = {}

rt = Path('./data')
file_paths = os.listdir(rt)
csv_file_paths = [f for f in file_paths if f.endswith('.csv')]
for csv_path in csv_file_paths:
    print(f'Loading {csv_path}...')
    df = pd.read_csv(rt.joinpath(csv_path))
    df = clean_data(df) # Utilisation d'un heler g√©n√©rique de nettoyage
    source = csv_path.replace('olist_', '').replace('.csv', '').replace('_dataset', '')
    data[source] = df

Loading olist_orders_dataset.csv...


  df[col] = pd.to_datetime(df[col],
  df[col] = pd.to_datetime(df[col],
  df[col] = pd.to_datetime(df[col],
  df[col] = pd.to_datetime(df[col],


Loading olist_products_dataset.csv...
Loading olist_order_items_dataset.csv...


  df[col] = pd.to_datetime(df[col],


Loading product_category_name_translation.csv...
Loading olist_sellers_dataset.csv...
Loading olist_geolocation_dataset.csv...
Loading olist_order_reviews_dataset.csv...


  df[col] = pd.to_datetime(df[col],
  df[col] = pd.to_datetime(df[col],


Loading olist_order_payments_dataset.csv...
Loading olist_customers_dataset.csv...


#### 2. Nettoyage: Cas sp√©cifiques

Dans cette partie du notebook j'applique des netoyages plus sp√©cifiques, sur des tables en particulier:
- customers: suppression de doublons et d"une colonne non utilis√©e => EDIT: Pas une bonne id√©e, je commente
- geolocation et customers: normalisation des noms de ville pour coh√©rence des √©ventuels PKI qui utiliseraient les noms de ville

In [None]:
# Nettoyage des doublons de donn√©es table customers
# data['customers'].drop_duplicates(subset=['customer_unique_id'], keep='first')
# data['customers'] = data['customers'].drop(columns=['customer_unique_id'])

In [3]:
from scripts.utils import norm

# Normalise les noms de villes
data['geolocation']['geolocation_city'] = data['geolocation']['geolocation_city'].apply(norm)
data['customers']['customer_city'] = data['customers']['customer_city'].apply(norm)

#### 3. Transformations

La transformation principale que j'ai imagin√©e sur ce cas d'√©cole a √©t√© d'ajouter les traductions de noms de cat√©gories directement dans la table produits.

Le but √©tant de simplifier les requ√™tes devant filtrer ou grouper par cat√©gorie, cela √©vite d'avoir √† faire des jointures ult√©rieures pour r√©cup√©rer les traductions.

Au d√©part j'ai m√™me consid√©r√© faire cela pour cmpl√®tement supprimer la table de traductions, mais j'ai d√©cid√© de la garder, dans le sc√©nario o√π un utilisateur br√©silien aurait besoin de faire des requ√™tes sur la base de donn√©es: dans ce cas les jointures resteront faisables.

In [4]:
# Transformation de la table des produits

# Merge des dataframes pour ajouter la traduction anglaise et √©viter les jointures
data['products'] = data['products'].merge(
    data['product_category_name_translation'],
    how='left', # Type de jointure : left pour conserver tous les produits m√™me sans traduction
    left_on='product_category_name', # Cl√© de jointure dans la table des produits
    right_on='product_category_name' # Cl√© de jointure dans la table de traduction
)

# Renommer la nouvelle colonne pour √©viter les conflits
data['products'].rename(columns={'product_category_name_english': 'product_category_name_en'},
                inplace=True)

# R√©sultat final
print(data['products'][['product_category_name', 'product_category_name_en']].head())

   product_category_name product_category_name_en
0             perfumaria                perfumery
1                  artes                      art
2          esporte_lazer           sports_leisure
3                  bebes                     baby
4  utilidades_domesticas               housewares


#### 4. Chargement en base de donn√©es


Pour le chargement j'ai choisi d'utiliser sqlite3 pour g√©rer la partie cr√©ation de tables en manuel avec des scripts SQL; √ßa permet d'avoir une cr√©ation de tables qui inclue directement toutes les contraintes n√©cessaires, et c'est plus simple que d'utiliser la syntaxe de sqlalchemy, qui ajoute une couche d'abstraction suppl√©mentaire √† SQL.

In [5]:
import sqlite3

DB_PATH = "./olist.db"
with sqlite3.connect(DB_PATH) as conn:
    cur = conn.cursor()

    # Ouverture et ex√©cution du script SQL de cr√©ation de sch√©ma de base de donn√©es
    with open('./scripts/schema.sql', 'r') as f:
        schema_sql = f.read()
        cur.executescript(schema_sql)

In [6]:
# Chargement des donn√©es nettoy√©es/transform√©es dans la base de donn√©es SQLite
for table_name, df in data.items():
    df.to_sql(table_name, con=conn, if_exists='replace', index=False)

### 5. D√©veloppement d'indicateurs de perfomances

#### üí∞ Ventes

In [10]:
from scripts.utils import benchmark_query

with open('./scripts/pki/sales/daily.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 633 rows in 0.905s

Statistiques: Fetched 633 rows in 0.393s
R√©sultat (max. 20):
{'jour': '2016-09-04', 'ca_jour': 136.23}
{'jour': '2016-09-05', 'ca_jour': 75.06}
{'jour': '2016-09-13', 'ca_jour': 40.95}
{'jour': '2016-10-02', 'ca_jour': 109.34}
{'jour': '2016-10-03', 'ca_jour': 595.14}
{'jour': '2016-10-04', 'ca_jour': 11914.65}
{'jour': '2016-10-05', 'ca_jour': 10212.22}
{'jour': '2016-10-06', 'ca_jour': 9398.61}
{'jour': '2016-10-07', 'ca_jour': 8255.19}
{'jour': '2016-10-08', 'ca_jour': 9716.13}
{'jour': '2016-10-09', 'ca_jour': 4152.11}
{'jour': '2016-10-10', 'ca_jour': 4675.1}
{'jour': '2016-10-22', 'ca_jour': 61.99}
{'jour': '2016-12-23', 'ca_jour': 19.62}
{'jour': '2017-01-05', 'ca_jour': 707.27}
{'jour': '2017-01-06', 'ca_jour': 988.28}
{'jour': '2017-01-07', 'ca_jour': 1447.99}
{'jour': '2017-01-08', 'ca_jour': 809.63}
{'jour': '2017-01-09', 'ca_jour': 774.51}
{'jour': '2017-01-10', 'ca_jour': 1571.17}


In [11]:
from scripts.utils import benchmark_query

with open('./scripts/pki/sales/daily.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 25 rows in 0.716s

Statistiques: Fetched 633 rows in 0.418s
R√©sultat (max. 20):
{'jour': '2016-09-04', 'ca_jour': 136.23}
{'jour': '2016-09-05', 'ca_jour': 75.06}
{'jour': '2016-09-13', 'ca_jour': 40.95}
{'jour': '2016-10-02', 'ca_jour': 109.34}
{'jour': '2016-10-03', 'ca_jour': 595.14}
{'jour': '2016-10-04', 'ca_jour': 11914.65}
{'jour': '2016-10-05', 'ca_jour': 10212.22}
{'jour': '2016-10-06', 'ca_jour': 9398.61}
{'jour': '2016-10-07', 'ca_jour': 8255.19}
{'jour': '2016-10-08', 'ca_jour': 9716.13}
{'jour': '2016-10-09', 'ca_jour': 4152.11}
{'jour': '2016-10-10', 'ca_jour': 4675.1}
{'jour': '2016-10-22', 'ca_jour': 61.99}
{'jour': '2016-12-23', 'ca_jour': 19.62}
{'jour': '2017-01-05', 'ca_jour': 707.27}
{'jour': '2017-01-06', 'ca_jour': 988.28}
{'jour': '2017-01-07', 'ca_jour': 1447.99}
{'jour': '2017-01-08', 'ca_jour': 809.63}
{'jour': '2017-01-09', 'ca_jour': 774.51}
{'jour': '2017-01-10', 'ca_jour': 1571.17}


In [12]:
from scripts.utils import benchmark_query

with open('./scripts/pki/sales/yearly.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 3 rows in 0.791s

Statistiques: Fetched 3 rows in 0.411s
R√©sultat (max. 20):
{'annee': '2016', 'ca_annee': 59362.34}
{'annee': '2017', 'ca_annee': 7249746.73}
{'annee': '2018', 'ca_annee': 8699763.05}


In [13]:
from scripts.utils import benchmark_query

with open('./scripts/pki/sales/previous_year_comparison.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 633 rows in 1.742s

Statistiques: Fetched 633 rows in 0.830s
R√©sultat (max. 20):
{'jour': '2016-09-04', 'ca_courant': 136.23, 'ca_n_1': 28719.24, 'variation_abs': -28583.010000000002, 'variation_pct': -99.53}
{'jour': '2016-09-05', 'ca_courant': 75.06, 'ca_n_1': 24954.68, 'variation_abs': -24879.62, 'variation_pct': -99.7}
{'jour': '2016-09-13', 'ca_courant': 40.95, 'ca_n_1': 32054.37, 'variation_abs': -32013.42, 'variation_pct': -99.87}
{'jour': '2016-10-02', 'ca_courant': 109.34, 'ca_n_1': 25565.11, 'variation_abs': -25455.77, 'variation_pct': -99.57}
{'jour': '2016-10-03', 'ca_courant': 595.14, 'ca_n_1': 30389.12, 'variation_abs': -29793.98, 'variation_pct': -98.04}
{'jour': '2016-10-04', 'ca_courant': 11914.65, 'ca_n_1': 24787.58, 'variation_abs': -12872.930000000002, 'variation_pct': -51.93}
{'jour': '2016-10-05', 'ca_courant': 10212.22, 'ca_n_1': 29129.83, 'variation_abs': -18917.61, 'variation_pct': -64.94}
{'jour': '2016-10-06', 'ca_courant': 9398.61, 'ca_n_1': 21890.58, 'variation_abs': -12491.9

In [14]:
from scripts.utils import benchmark_query

with open('./scripts/pki/sales/top10.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 10 rows in 1.228s

Statistiques: Fetched 10 rows in 0.681s
R√©sultat (max. 20):
{'product_id': '5769ef0a239114ac3a854af00df129e4', 'product_category_name_en': 'fixed_telephony', 'product_name_length': 39.0, 'ca_produit': 109312.64, 'rang': 1}
{'product_id': 'bb50f2e236e5eea0100680137654686c', 'product_category_name_en': 'health_beauty', 'product_name_length': 60.0, 'ca_produit': 81887.42, 'rang': 2}
{'product_id': '422879e10f46682990de24d770e7f83d', 'product_category_name_en': 'garden_tools', 'product_name_length': 56.0, 'ca_produit': 79512.22, 'rang': 3}
{'product_id': 'd1c427060a0f73f6b889a5c7c61f2ac4', 'product_category_name_en': 'computers_accessories', 'product_name_length': 59.0, 'ca_produit': 70557.9, 'rang': 4}
{'product_id': '6cdd53843498f92890544667809f1595', 'product_category_name_en': 'health_beauty', 'product_name_length': 51.0, 'ca_produit': 64825.67, 'rang': 5}
{'product_id': 'd5991653e037ccb7af6ed7d94246b249', 'product_category_name_en': 'computers_accessories', 'product_name_length': 42.

#### üë• Clients

In [15]:
from scripts.utils import benchmark_query

with open('./scripts/pki/customers/new_vs_recurring_customers.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 1 rows in 0.141s

Statistiques: Fetched 1 rows in 0.067s
R√©sultat (max. 20):
{'new_customers': 99441, 'returning_customers': 0}


In [16]:
from scripts.utils import benchmark_query

with open('./scripts/pki/customers/average_cart.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 1 rows in 0.135s

Statistiques: Fetched 1 rows in 0.060s
R√©sultat (max. 20):
{'avg_basket_value': 160.57763809214927}


In [17]:
from scripts.utils import benchmark_query

with open('./scripts/pki/customers/conversion_rate.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 1 rows in 0.027s

Statistiques: Fetched 1 rows in 0.012s
R√©sultat (max. 20):
{'conversion_rate': 0.9702034372140264}


In [18]:
from scripts.utils import benchmark_query

with open('./scripts/pki/customers/rfmbbq_analysis.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 98666 rows in 1.195s

Statistiques: Fetched 98666 rows in 0.534s
R√©sultat (max. 20):
{'customer_id': '1617b1357756262bfa56ab541c47bc16', 'recency_days': 3058, 'frequency': 1, 'monetary': 13664.08}
{'customer_id': 'ec5b2ba62e574342386871631fafd3fc', 'recency_days': 2769, 'frequency': 1, 'monetary': 7274.88}
{'customer_id': 'c6e2731c5b391845f6800c97401a43a9', 'recency_days': 3287, 'frequency': 1, 'monetary': 6929.31}
{'customer_id': 'f48d464a0baaea338cb25f816991ab1f', 'recency_days': 2759, 'frequency': 1, 'monetary': 6922.21}
{'customer_id': '3fd6777bbce08a352fddd04e4a7cc8f6', 'recency_days': 3186, 'frequency': 1, 'monetary': 6726.66}
{'customer_id': '05455dfa7cd02f13d132aa7a6a9729c6', 'recency_days': 3002, 'frequency': 1, 'monetary': 6081.54}
{'customer_id': 'df55c14d1476a9a3467f131269c2477f', 'recency_days': 3239, 'frequency': 1, 'monetary': 4950.34}
{'customer_id': 'e0a2412720e9ea4f26c1ac985f6a7358', 'recency_days': 2772, 'frequency': 1, 'monetary': 4809.44}
{'customer_id': '24bbf5fd2f2e1b359ee7de94defc4a

### üìä Cohortes

In [19]:
from scripts.utils import benchmark_query

with open('./scripts/pki/cohorts/first_month_retention.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 634 rows in 0.409s

Statistiques: Fetched 634 rows in 0.424s
R√©sultat (max. 20):
{'cohort_date': '2016-09-04', 'cohort_month': 0, 'active_customers': 1}
{'cohort_date': '2016-09-05', 'cohort_month': 0, 'active_customers': 1}
{'cohort_date': '2016-09-13', 'cohort_month': 0, 'active_customers': 1}
{'cohort_date': '2016-09-15', 'cohort_month': 0, 'active_customers': 1}
{'cohort_date': '2016-10-02', 'cohort_month': 0, 'active_customers': 1}
{'cohort_date': '2016-10-03', 'cohort_month': 0, 'active_customers': 8}
{'cohort_date': '2016-10-04', 'cohort_month': 0, 'active_customers': 63}
{'cohort_date': '2016-10-05', 'cohort_month': 0, 'active_customers': 47}
{'cohort_date': '2016-10-06', 'cohort_month': 0, 'active_customers': 51}
{'cohort_date': '2016-10-07', 'cohort_month': 0, 'active_customers': 46}
{'cohort_date': '2016-10-08', 'cohort_month': 0, 'active_customers': 42}
{'cohort_date': '2016-10-09', 'cohort_month': 0, 'active_customers': 26}
{'cohort_date': '2016-10-10', 'cohort_month': 0, 'active_customers':

In [20]:
from scripts.utils import benchmark_query

with open('./scripts/pki/cohorts/lifetime_value_per_cohort.sql', 'r') as f:
    sql_script = f.read()
    result, message = benchmark_query(sql_script)
    
    print('Statistiques:', message)
    print('R√©sultat (max. 20):')
    for row in result:
        print(dict(row))

# Performance non optimis√©e: Fetched 616 rows in 0.852s

Statistiques: Fetched 616 rows in 0.782s
R√©sultat (max. 20):
{'cohort_date': '2016-09-04', 'customers_in_cohort': 1, 'total_revenue': 136.23, 'ltv_per_customer': 136.23}
{'cohort_date': '2016-09-05', 'customers_in_cohort': 1, 'total_revenue': 75.06, 'ltv_per_customer': 75.06}
{'cohort_date': '2016-09-15', 'customers_in_cohort': 1, 'total_revenue': 143.46, 'ltv_per_customer': 143.46}
{'cohort_date': '2016-10-02', 'customers_in_cohort': 1, 'total_revenue': 109.34, 'ltv_per_customer': 109.34}
{'cohort_date': '2016-10-03', 'customers_in_cohort': 8, 'total_revenue': 595.14, 'ltv_per_customer': 74.3925}
{'cohort_date': '2016-10-04', 'customers_in_cohort': 60, 'total_revenue': 11295.48, 'ltv_per_customer': 188.25799999999998}
{'cohort_date': '2016-10-05', 'customers_in_cohort': 42, 'total_revenue': 9645.94, 'ltv_per_customer': 229.6652380952381}
{'cohort_date': '2016-10-06', 'customers_in_cohort': 49, 'total_revenue': 9131.23, 'ltv_per_customer': 186.35163265306122}
{'cohort_date': '2016-10-