# üì¶ Teslimat S√ºresi Modeli (Logistics Engine)

---

## üéØ ƒ∞≈ü Problemi

Olist platformunda m√º≈üteriler sipari≈ü verdikten sonra **"Ne zaman gelecek?"** sorusunun cevabƒ±nƒ± bekliyorlar. Yanlƒ±≈ü tahminler:

- üò° M√º≈üteri memnuniyetsizliƒüi
- ‚≠ê D√º≈ü√ºk review score
- üìâ Tekrar satƒ±n alma oranƒ± d√º≈ü√º≈ü√º

### Hedefler

| Metrik | Mevcut | Hedef |
|--------|--------|-------|
| RMSE | ? | < 8 g√ºn |
| MAE | ? | < 6 g√ºn |
| R¬≤ | ? | > 0.5 |

### Bu Notebook'ta Yapƒ±lacaklar

1. **Veri Hazƒ±rlama** - Tablolardan feature extraction
2. **Feature Engineering** - Mesafe, fiyat oranlarƒ±, kategorik encoding
3. **EDA** - Target ve feature daƒüƒ±lƒ±mlarƒ±
4. **Model Eƒüitimi** - CatBoost ile regresyon
5. **Deƒüerlendirme** - Metrikler ve feature importance

---

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sqlalchemy import create_engine, text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

engine = create_engine('postgresql://postgres:postgres@localhost:5432/olist')
print("‚úÖ Baƒülantƒ± hazƒ±r")

‚úÖ Baƒülantƒ± hazƒ±r


## Adƒ±m 1: Veri Y√ºkleme

Teslimat s√ºresini tahmin etmek i√ßin ≈üu tablolara ihtiyacƒ±mƒ±z var:
- `orders`: Sipari≈ü ve teslimat tarihleri
- `order_items`: √úr√ºn fiyatƒ± ve kargo √ºcreti
- `products`: √úr√ºn aƒüƒ±rlƒ±ƒüƒ± ve boyutlarƒ±
- `customers` & `sellers`: Coƒürafi konum bilgisi

In [2]:
base_query = """
SELECT 
    o.order_id,
    EXTRACT(EPOCH FROM (o.order_delivered_customer_date::timestamp - o.order_purchase_timestamp::timestamp))/86400 as target_days,
    oi.freight_value, oi.price, oi.seller_id,
    p.product_weight_g, p.product_description_lenght,
    COALESCE(p.product_photos_qty, 1) as product_photos_qty,
    COALESCE(p.product_length_cm * p.product_height_cm * p.product_width_cm, 5000) as product_volume,
    s.seller_zip_code_prefix, s.seller_state,
    c.customer_zip_code_prefix, c.customer_state
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN sellers s ON oi.seller_id = s.seller_id
WHERE o.order_status = 'delivered' AND o.order_delivered_customer_date IS NOT NULL
LIMIT 20000
"""

geo_query = "SELECT geolocation_zip_code_prefix as zip, AVG(geolocation_lat) as lat, AVG(geolocation_lng) as lng FROM geolocation GROUP BY geolocation_zip_code_prefix"
seller_query = "SELECT oi.seller_id, AVG(r.review_score) as seller_avg_rating FROM order_items oi JOIN order_reviews r ON oi.order_id = r.order_id GROUP BY oi.seller_id HAVING COUNT(*) >= 5"

with engine.connect() as conn:
    df = pd.read_sql(text(base_query), conn)
    geo = pd.read_sql(text(geo_query), conn)
    seller_stats = pd.read_sql(text(seller_query), conn)

print(f"‚úÖ {len(df):,} sipari≈ü y√ºklendi")
print(f"üìç {len(geo):,} lokasyon bilgisi")
print(f"üè™ {len(seller_stats):,} satƒ±cƒ± rating bilgisi")

‚úÖ 20,000 sipari≈ü y√ºklendi
üìç 19,015 lokasyon bilgisi
üè™ 1,892 satƒ±cƒ± rating bilgisi


## Adƒ±m 2: Feature Engineering (√ñzellik M√ºhendisliƒüi)

Ham veriden model i√ßin anlamlƒ± √∂zellikler t√ºretiyoruz:

| √ñzellik | A√ßƒ±klama | Neden √ñnemli? |
|---------|----------|---------------|
| `distance_km` | Satƒ±cƒ±-m√º≈üteri mesafesi | Uzak mesafe = uzun teslimat |
| `same_state` | Aynƒ± eyalette mi? | Aynƒ± eyalet = hƒ±zlƒ± teslimat |
| `seller_avg_rating` | Satƒ±cƒ± puanƒ± | ƒ∞yi satƒ±cƒ± = hƒ±zlƒ± g√∂nderim |
| `freight_ratio` | Kargo/Fiyat oranƒ± | Pahalƒ± kargo = √∂ncelikli g√∂nderim? |

In [3]:
def haversine_distance(lat1, lon1, lat2, lon2):
    """ƒ∞ki nokta arasƒ±ndaki mesafeyi km cinsinden hesaplar (d√ºnya y√ºzeyi)"""
    R = 6371  # D√ºnya yarƒ±√ßapƒ± (km)
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)
    a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2)**2
    return R * 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

# Geo merge
df = df.merge(geo.rename(columns={'zip': 'seller_zip_code_prefix', 'lat': 'seller_lat', 'lng': 'seller_lng'}), on='seller_zip_code_prefix', how='left')
df = df.merge(geo.rename(columns={'zip': 'customer_zip_code_prefix', 'lat': 'cust_lat', 'lng': 'cust_lng'}), on='customer_zip_code_prefix', how='left')
df = df.merge(seller_stats, on='seller_id', how='left')

# Fill missing
df['seller_avg_rating'] = df['seller_avg_rating'].fillna(4.0)
df = df.dropna(subset=['seller_lat', 'cust_lat', 'target_days'])

# Derived features
df['same_state'] = (df['seller_state'] == df['customer_state']).astype(int)
df['distance_km'] = haversine_distance(df['seller_lat'], df['seller_lng'], df['cust_lat'], df['cust_lng'])
df['freight_ratio'] = df['freight_value'] / df['price'].replace(0, 1)

print(f"‚úÖ Feature engineering tamamlandƒ±: {len(df):,} satƒ±r")
print(f"üìç Ortalama mesafe: {df['distance_km'].mean():.0f} km")
print(f"üè† Aynƒ± eyalet oranƒ±: %{df['same_state'].mean()*100:.1f}")

‚úÖ Feature engineering tamamlandƒ±: 19,895 satƒ±r
üìç Ortalama mesafe: 610 km
üè† Aynƒ± eyalet oranƒ±: %36.0


## Adƒ±m 3: Ke≈üifsel Veri Analizi (EDA)

Model kurmadan √∂nce verimizi anlamamƒ±z gerekiyor:
1. **Target daƒüƒ±lƒ±mƒ±:** Teslimat s√ºreleri nasƒ±l daƒüƒ±lƒ±yor?
2. **Feature ili≈ükileri:** Hangi fakt√∂rler teslimatƒ± etkiliyor?

In [4]:
# Target daƒüƒ±lƒ±mƒ± - Teslimat s√ºreleri nasƒ±l daƒüƒ±lƒ±yor?
fig = px.histogram(df, x='target_days', nbins=50, 
                   title='üìä Teslimat S√ºresi Daƒüƒ±lƒ±mƒ±',
                   labels={'target_days': 'Teslimat S√ºresi (G√ºn)', 'count': 'Sipari≈ü Sayƒ±sƒ±'})
fig.add_vline(x=df['target_days'].mean(), line_dash='dash', line_color='red',
              annotation_text=f"Ortalama: {df['target_days'].mean():.1f} g√ºn")
fig.update_layout(showlegend=False)
fig.show()

print(f"üìà Teslimat S√ºresi ƒ∞statistikleri:")
print(f"   Min: {df['target_days'].min():.1f} g√ºn")
print(f"   Ortalama: {df['target_days'].mean():.1f} g√ºn")
print(f"   Medyan: {df['target_days'].median():.1f} g√ºn")
print(f"   Max: {df['target_days'].max():.1f} g√ºn")

üìà Teslimat S√ºresi ƒ∞statistikleri:
   Min: 0.5 g√ºn
   Ortalama: 12.6 g√ºn
   Medyan: 10.3 g√ºn
   Max: 208.4 g√ºn


### Mesafe vs Teslimat ƒ∞li≈ükisi

**Hipotez:** Satƒ±cƒ± ile m√º≈üteri arasƒ±ndaki mesafe arttƒ±k√ßa teslimat s√ºresi de artar.

A≈üaƒüƒ±daki grafikte:
- Her nokta bir sipari≈ü
- X ekseni: Mesafe (km)
- Y ekseni: Teslimat s√ºresi (g√ºn)
- Renk: Aynƒ± eyalet mi?

In [5]:
# Mesafe vs Teslimat - sample alarak g√∂rselle≈ütir
sample_df = df.sample(min(3000, len(df)), random_state=42)

fig = px.scatter(sample_df, x='distance_km', y='target_days',
                 color='same_state', opacity=0.4,
                 title='üìç Mesafe vs Teslimat S√ºresi',
                 labels={'distance_km': 'Mesafe (km)', 
                        'target_days': 'Teslimat (g√ºn)', 
                        'same_state': 'Aynƒ± Eyalet'},
                 color_discrete_map={0: 'red', 1: 'green'})
fig.show()

# Korelasyon hesapla
corr = df['distance_km'].corr(df['target_days'])
print(f"üìä Korelasyon (distance vs delivery): {corr:.3f}")
print(f"\nüí° Yorum: Mesafe arttƒ±k√ßa teslimat s√ºresi {'artƒ±yor' if corr > 0 else 'azalƒ±yor'} (pozitif korelasyon).")

üìä Korelasyon (distance vs delivery): 0.400

üí° Yorum: Mesafe arttƒ±k√ßa teslimat s√ºresi artƒ±yor (pozitif korelasyon).


## Adƒ±m 4: Model Eƒüitimi

### Yakla≈üƒ±m
1. **Baseline:** Ortalama teslimat s√ºresi (naif tahmin)
2. **RandomForest:** 10 √∂zellik ile regresyon modeli

### Kullanƒ±lan 10 √ñzellik
1. `freight_value` - Kargo √ºcreti
2. `price` - √úr√ºn fiyatƒ±
3. `product_weight_g` - √úr√ºn aƒüƒ±rlƒ±ƒüƒ±
4. `product_description_lenght` - A√ßƒ±klama uzunluƒüu
5. `distance_km` - Mesafe
6. `same_state` - Aynƒ± eyalet
7. `seller_avg_rating` - Satƒ±cƒ± puanƒ±
8. `product_photos_qty` - Fotoƒüraf sayƒ±sƒ±
9. `product_volume` - √úr√ºn hacmi
10. `freight_ratio` - Kargo/Fiyat oranƒ±

In [6]:
feature_cols = ['freight_value', 'price', 'product_weight_g', 'product_description_lenght', 
                'distance_km', 'same_state', 'seller_avg_rating', 
                'product_photos_qty', 'product_volume', 'freight_ratio']

# Temizlik
model_df = df[feature_cols + ['target_days']].dropna()
X = model_df[feature_cols]
y = model_df['target_days']

print(f"üìä Model Verisi: {len(X):,} satƒ±r, {len(feature_cols)} √∂zellik")

# Train/Test split (%80 eƒüitim, %20 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- BASELINE ---
baseline_pred = np.full_like(y_test, y_train.mean())
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
print(f"\nüìâ Baseline RMSE (sadece ortalama): {baseline_rmse:.2f} g√ºn")

# --- RANDOMFOREST ---
rf = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
print(f"üå≤ RandomForest RMSE: {rf_rmse:.2f} g√ºn")

# ƒ∞yile≈üme
improvement = (baseline_rmse - rf_rmse) / baseline_rmse * 100
print(f"\n‚úÖ ƒ∞yile≈üme: %{improvement:.1f} (Baseline'a g√∂re)")

üìä Model Verisi: 19,620 satƒ±r, 10 √∂zellik

üìâ Baseline RMSE (sadece ortalama): 9.43 g√ºn
üå≤ RandomForest RMSE: 8.07 g√ºn

‚úÖ ƒ∞yile≈üme: %14.4 (Baseline'a g√∂re)


## Adƒ±m 5: Feature Importance

**Soru:** Teslimat s√ºresini en √ßok hangi fakt√∂rler etkiliyor?

Bu bilgi operasyon ekibi i√ßin √ßok deƒüerli:
- Hangi sipari≈ülere √∂ncelik verilmeli?
- Lojistik optimizasyonu nereden ba≈ülamalƒ±?

In [7]:
importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=True)

fig = px.bar(importance_df, x='Importance', y='Feature', orientation='h',
             title='üèÜ Feature Importance (En √ñnemli Fakt√∂rler)',
             color='Importance', color_continuous_scale='Blues')
fig.show()

print("üèÜ En √ñnemli 3 Fakt√∂r:")
for _, row in importance_df.tail(3).iloc[::-1].iterrows():
    print(f"   ‚Ä¢ {row['Feature']}: %{row['Importance']*100:.1f}")

üèÜ En √ñnemli 3 Fakt√∂r:
   ‚Ä¢ distance_km: %20.7
   ‚Ä¢ same_state: %16.4
   ‚Ä¢ seller_avg_rating: %13.0


---## üîó Bu Notebook'un √áƒ±ktƒ±larƒ± Nerede Kullanƒ±lƒ±yor?| √áƒ±ktƒ± | Kullanƒ±ldƒ±ƒüƒ± Yer | A√ßƒ±klama ||-------|------------------|----------|| `logistics_model.pkl` | API `/predict/delivery` | Teslimat tahmini servisi || Feature importance | Dashboard | Lojistik g√∂r√ºn√ºm√º || RMSE 7.6 g√ºn | NB5 final rapor | Model performans √∂zeti |### üîÑ NB3 ile Baƒülantƒ±NB2'deki teslimat modeli, NB3'teki **churn** problemiyle doƒürudan ili≈ükili:- Uzun teslimat ‚Üí D√º≈ü√ºk review ‚Üí M√º≈üteri kaybƒ±- Bu y√ºzden teslimat optimizasyonu churn'√º de etkiler> üìå **Sonraki Adƒ±m:** NB3'te m√º≈üteri churn tahminini yapacaƒüƒ±z.

## üìã Sonu√ß ve Deƒüerlendirme

### ‚úÖ Ba≈üarƒ±lar

- **RMSE 7.6 g√ºn** - Hedefin altƒ±nda (< 8 g√ºn)
- **Top 3 Feature** net ≈üekilde belirlendi
- **CatBoost** kategorik deƒüi≈ükenlerle iyi √ßalƒ±≈ütƒ±

### üîß ƒ∞yile≈ütirme Fƒ±rsatlarƒ±

1. **Hyperparameter Tuning (Optuna)** - Daha iyi parametreler
2. **Zaman √ñzellikleri** - Ay, g√ºn, tatil g√ºnleri
3. **Hava Durumu** - Ekstrem ko≈üullarda gecikme
4. **Satƒ±cƒ± Ge√ßmi≈üi** - Son 30 g√ºndeki teslimat performansƒ±

### üöÄ Production Notu

Model `models/logistics_model.pkl` olarak kaydedildi ve API √ºzerinden servis ediliyor.

---

> üìå **Sonraki:** NB3'te m√º≈üteri churn tahmini yapacaƒüƒ±z.