# üõí Olist E-Ticaret Intelligence Projesi

## Notebook 1: Ke≈üifsel Veri Analizi (EDA)

---

### üìã Proje √ñzeti

Bu projede Brezilya'nƒ±n en b√ºy√ºk e-ticaret platformu Olist'in verilerini analiz ediyoruz.

### üéØ ƒ∞≈ü Problemleri

| Problem | Mevcut Durum | Hedef |
|---------|--------------|-------|
| Teslimat tahmini | 12.5 g√ºn ¬± 9.5 | RMSE < 8 g√ºn |
| M√º≈üteri memnuniyeti | %12 d√º≈ü√ºk review | < %8 |
| Tekrar satƒ±n alma | %3 | > %10 |

### üìä Bu Notebook'ta:
1. Veri ke≈üfi
2. Teslimat s√ºresi analizi
3. M√º≈üteri davranƒ±≈üƒ±
4. Review score analizi
5. Coƒürafi analiz


In [1]:
# K√ºt√ºphaneler
import pandas as pd
import numpy as np
import plotly.express as px
from sqlalchemy import create_engine, text
import warnings
warnings.filterwarnings('ignore')

# Veritabanƒ± baƒülantƒ±sƒ±
engine = create_engine('sqlite:///../olist.db')
print('‚úÖ Veritabanƒ± baƒülantƒ±sƒ± kuruldu')

‚úÖ Veritabanƒ± baƒülantƒ±sƒ± kuruldu


In [2]:
# --- OTOMATƒ∞K VERƒ∞ Y√úKLEME ---
# Eƒüer tablolar veritabanƒ±nda yoksa, otomatik olarak Kaggle'dan indirip y√ºkler.

from sqlalchemy import inspect
import sys
import os

# Proje k√∂k dizinine eri≈üim (src mod√ºl√º i√ßin)
if '..' not in sys.path:
    sys.path.append(os.path.abspath('..'))

from src.config import DATABASE_URL, DATA_RAW_PATH
from src.ml.ingest import OlistIngestor

inspector = inspect(engine)
if not inspector.has_table('orders'):
    print('‚ö†Ô∏è Tablolar bulunamadƒ±. Veri y√ºkleme i≈ülemi ba≈ülatƒ±lƒ±yor... (Bu i≈ülem 1-2 dk s√ºrebilir)')
    # Config dosyasƒ±ndaki URL yerine notebook engine'ini kullanabiliriz ama ingestor db_url ister
    # engine.url stringe √ßevrilebilir
    ingestor = OlistIngestor(str(engine.url), str(DATA_RAW_PATH))
    ingestor.run()
    print('‚úÖ Veri y√ºkleme tamamlandƒ±!')
else:
    print('‚úÖ Tablolar mevcut. Analize devam edebilirsiniz.')


‚úÖ Tablolar mevcut. Analize devam edebilirsiniz.


---
## üìä Adƒ±m 1: Veri Ke≈üfi

Olist veri seti 8 tablodan olu≈üuyor. Her tablonun boyutunu inceleyelim.

In [3]:
# Tablo boyutlarƒ±
tables = ['orders', 'order_items', 'customers', 'sellers', 
          'products', 'order_reviews', 'order_payments', 'geolocation']

print('üì¶ Tablo Boyutlarƒ±')
print('='*40)
for table in tables:
    try:
        with engine.connect() as conn:
            result = conn.execute(text(f'SELECT COUNT(*) FROM {table}'))
            count = result.scalar()
        print(f'{table}: {count:,} satƒ±r')
    except Exception as e:
        print(f'{table}: HATA - {e}')

üì¶ Tablo Boyutlarƒ±
orders: 99,441 satƒ±r
order_items: 112,650 satƒ±r
customers: 99,441 satƒ±r
sellers: 3,095 satƒ±r
products: 32,951 satƒ±r
order_reviews: 99,224 satƒ±r
order_payments: 103,886 satƒ±r
geolocation: 1,000,163 satƒ±r


### üìà Yorum

- **orders (~100K)**: Ana analiz tablosu
- **customers (~99K)**: Neredeyse her sipari≈ü farklƒ± m√º≈üteri ‚Üí d√º≈ü√ºk retention
- **order_items (~112K)**: Sipari≈ü ba≈üƒ±na ~1.1 √ºr√ºn ‚Üí d√º≈ü√ºk sepet boyutu

---
## üéØ Adƒ±m 2: Teslimat S√ºresi Analizi

Teslimat s√ºresi = Sipari≈ü tarihi ile teslim tarihi arasƒ±ndaki g√ºn farkƒ±

In [4]:
# Teslimat s√ºresi hesaplama
query = '''
SELECT 
    (JULIANDAY(order_delivered_customer_date) - JULIANDAY(order_purchase_timestamp)) as teslimat_gunu
FROM orders 
WHERE order_delivered_customer_date IS NOT NULL
    AND order_status = 'delivered'
'''

with engine.connect() as conn:
    delivery = pd.read_sql(text(query), conn)

print('üì¶ Teslimat S√ºresi ƒ∞statistikleri')
print('='*40)
print(f'Ortalama: {delivery["teslimat_gunu"].mean():.1f} g√ºn')
print(f'Medyan:   {delivery["teslimat_gunu"].median():.1f} g√ºn')
print(f'Std:      {delivery["teslimat_gunu"].std():.1f} g√ºn')
print(f'Min:      {delivery["teslimat_gunu"].min():.1f} g√ºn')
print(f'Max:      {delivery["teslimat_gunu"].max():.1f} g√ºn')

# Histogram
fig = px.histogram(delivery, x='teslimat_gunu', nbins=50,
                   title='Teslimat S√ºresi Daƒüƒ±lƒ±mƒ±',
                   labels={'teslimat_gunu': 'Teslimat S√ºresi (G√ºn)'})
fig.add_vline(x=delivery['teslimat_gunu'].mean(), line_dash='dash', 
              line_color='red', annotation_text='Ortalama')
fig.show()

üì¶ Teslimat S√ºresi ƒ∞statistikleri
Ortalama: 12.6 g√ºn
Medyan:   10.2 g√ºn
Std:      9.5 g√ºn
Min:      0.5 g√ºn
Max:      209.6 g√ºn


### üìä Yorum

- **Saƒüa √ßarpƒ±k daƒüƒ±lƒ±m**: Ortalama > Medyan (uzun teslimatlar etkili)
- **Y√ºksek varyans**: Tahmin zorluƒüu
- **Aykƒ±rƒ± deƒüerler**: 50+ g√ºn teslimatlar var

---
## üë• Adƒ±m 3: M√º≈üteri Davranƒ±≈üƒ±

M√º≈üteri ba≈üƒ±na sipari≈ü sayƒ±sƒ± ve retention analizi

In [5]:
# M√º≈üteri ba≈üƒ±na sipari≈ü
query = '''
SELECT c.customer_unique_id, COUNT(DISTINCT o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_status = 'delivered'
GROUP BY c.customer_unique_id
'''

with engine.connect() as conn:
    customer_orders = pd.read_sql(text(query), conn)

print('üë• M√º≈üteri Sipari≈ü Daƒüƒ±lƒ±mƒ±')
print('='*40)
freq = customer_orders['order_count'].value_counts().sort_index()
for count in range(1, min(6, len(freq)+1)):
    if count in freq.index:
        pct = freq[count] / len(customer_orders) * 100
        print(f'{count} sipari≈ü: {freq[count]:,} m√º≈üteri ({pct:.1f}%)')

tek_siparisli = (customer_orders['order_count'] == 1).mean() * 100
print(f'\n‚ö†Ô∏è Tek seferlik m√º≈üteri oranƒ±: {tek_siparisli:.1f}%')

# Pie chart
fig = px.pie(values=[tek_siparisli, 100-tek_siparisli],
             names=['Tek Seferlik', 'Tekrar Eden'],
             title='M√º≈üteri Retention Durumu',
             color_discrete_sequence=['#FF6B6B', '#4ECDC4'])
fig.show()

üë• M√º≈üteri Sipari≈ü Daƒüƒ±lƒ±mƒ±
1 sipari≈ü: 90,557 m√º≈üteri (97.0%)
2 sipari≈ü: 2,573 m√º≈üteri (2.8%)
3 sipari≈ü: 181 m√º≈üteri (0.2%)
4 sipari≈ü: 28 m√º≈üteri (0.0%)
5 sipari≈ü: 9 m√º≈üteri (0.0%)

‚ö†Ô∏è Tek seferlik m√º≈üteri oranƒ±: 97.0%


### üë• Yorum

| Metrik | Deƒüer | Durum |
|--------|-------|-------|
| Tek seferlik m√º≈üteri | ~97% | üî¥ Kritik |

**Sonu√ß**: Ciddi retention problemi. NB3'te churn modeli geli≈ütirilecek.

---
## ‚≠ê Adƒ±m 4: Review Score Analizi

M√º≈üteri memnuniyetinin doƒürudan g√∂stergesi

In [6]:
# Review score daƒüƒ±lƒ±mƒ±
query = '''
SELECT review_score, COUNT(*) as count
FROM order_reviews
GROUP BY review_score
ORDER BY review_score
'''

with engine.connect() as conn:
    reviews = pd.read_sql(text(query), conn)

print('‚≠ê Review Score Daƒüƒ±lƒ±mƒ±')
print('='*40)
total = reviews['count'].sum()
for _, row in reviews.iterrows():
    pct = row['count'] / total * 100
    bar = '‚ñà' * int(pct/2)
    print(f'{int(row["review_score"])} yƒ±ldƒ±z: {bar} {pct:.1f}%')

dusuk = reviews[reviews['review_score'] <= 2]['count'].sum() / total * 100
print(f'\nüî¥ D√º≈ü√ºk review oranƒ± (1-2): {dusuk:.1f}%')

fig = px.bar(reviews, x='review_score', y='count',
             title='Review Score Daƒüƒ±lƒ±mƒ±',
             labels={'review_score': 'Score', 'count': 'Sayƒ±'})
fig.show()

‚≠ê Review Score Daƒüƒ±lƒ±mƒ±
1 yƒ±ldƒ±z: ‚ñà‚ñà‚ñà‚ñà‚ñà 11.5%
2 yƒ±ldƒ±z: ‚ñà 3.2%
3 yƒ±ldƒ±z: ‚ñà‚ñà‚ñà‚ñà 8.2%
4 yƒ±ldƒ±z: ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 19.3%
5 yƒ±ldƒ±z: ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 57.8%

üî¥ D√º≈ü√ºk review oranƒ± (1-2): 14.7%


### ‚≠ê Yorum

- 5 yƒ±ldƒ±z dominant (~55%) - Pozitif
- 1 yƒ±ldƒ±z ~11% - Kritik segment

---
## üó∫Ô∏è Adƒ±m 5: Coƒürafi Analiz

Brezilya'nƒ±n farklƒ± b√∂lgelerindeki teslimat performansƒ±

In [7]:
# Eyalet bazlƒ± analiz
query = '''
SELECT 
    c.customer_state as state,
    COUNT(DISTINCT o.order_id) as orders,
    AVG((JULIANDAY(o.order_delivered_customer_date) - JULIANDAY(o.order_purchase_timestamp))) as avg_delivery
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_delivered_customer_date IS NOT NULL
GROUP BY c.customer_state
ORDER BY orders DESC
LIMIT 10
'''

with engine.connect() as conn:
    geo = pd.read_sql(text(query), conn)

print('üó∫Ô∏è Top 10 Eyalet')
print('='*50)
for _, row in geo.iterrows():
    print(f"{row['state']}: {int(row['orders']):,} sipari≈ü, {row['avg_delivery']:.1f} g√ºn")

fig = px.bar(geo, x='state', y='orders', color='avg_delivery',
             title='Eyalet Bazlƒ± Sipari≈ü ve Teslimat S√ºresi',
             labels={'state': 'Eyalet', 'orders': 'Sipari≈ü', 'avg_delivery': 'Ort. Teslimat'})
fig.show()

üó∫Ô∏è Top 10 Eyalet
SP: 40,495 sipari≈ü, 8.8 g√ºn
RJ: 12,353 sipari≈ü, 15.3 g√ºn
MG: 11,355 sipari≈ü, 12.0 g√ºn
RS: 5,344 sipari≈ü, 15.3 g√ºn
PR: 4,923 sipari≈ü, 12.0 g√ºn
SC: 3,547 sipari≈ü, 15.0 g√ºn
BA: 3,256 sipari≈ü, 19.3 g√ºn
DF: 2,080 sipari≈ü, 13.0 g√ºn
ES: 1,995 sipari≈ü, 15.8 g√ºn
GO: 1,957 sipari≈ü, 15.6 g√ºn


### üó∫Ô∏è Yorum

- **SP (S√£o Paulo)**: En yoƒüun, en hƒ±zlƒ±
- **Kuzey b√∂lgeleri**: D√º≈ü√ºk yoƒüunluk, yava≈ü teslimat

---
## üîó Bu Notebook'un √áƒ±ktƒ±larƒ± Nerede Kullanƒ±lƒ±yor?

| √áƒ±ktƒ± | Kullanƒ±ldƒ±ƒüƒ± Yer | A√ßƒ±klama |
|-------|------------------|----------|
| Teslimat ortalamasƒ± (12.5 g√ºn) | NB2 baseline | Model iyile≈ütirme hedefi |
| %97 tek seferlik | NB3 churn tanƒ±mƒ± | Problem motivasyonu |
| Review daƒüƒ±lƒ±mƒ± | Dashboard | Memnuniyet metrikleri |
| Coƒürafi bulgular | NB2 features | distance_km, same_state |

> üìå **Baƒülantƒ±:** Bu analizler NB2'deki model √∂zelliklerini ve NB3'teki churn tanƒ±mƒ±nƒ± ≈üekillendiriyor.


---
## üìã Sonu√ß

### üîç Kritik Bulgular

| # | Bulgu | √ñncelik |
|---|-------|---------|
| 1 | %97 tek seferlik m√º≈üteri | üî¥ Kritik |
| 2 | 12.5 g√ºn ortalama teslimat | üî¥ Kritik |
| 3 | %12 d√º≈ü√ºk review | üü° Y√ºksek |
| 4 | Coƒürafi e≈üitsizlik | üü° Y√ºksek |

### üéØ Sonraki Adƒ±mlar

- **NB2**: Teslimat s√ºresi tahmini modeli
- **NB3**: M√º≈üteri churn tahmini
- **NB4**: M√º≈üteri segmentasyonu
- **NB5**: Final deƒüerlendirme

---

> üìå **EDA Tamamlandƒ±.** Sonraki notebook'a ge√ßebilirsiniz.