# üìä Mod√ºl 1: B√ºy√ºk Resim (EDA & Data Mart)

**Ama√ß:** Olist veritabanƒ±ndaki 9 farklƒ± tabloyu analiz etmek, temizlemek ve modellemeye hazƒ±r tek bir 'Master Table' (Data Mart) haline getirmek.

**Adƒ±mlar:**
1.  **Veri Baƒülantƒ±sƒ±:** PostgreSQL'den verileri √ßekme.
2.  **Kalite Kontrol√º:** Eksik veri ve tip analizi.
3.  **G√∂rselle≈ütirme:** Sipari≈ü daƒüƒ±lƒ±mlarƒ± ve m√º≈üteri lokasyonlarƒ±.
4.  **Birle≈ütirme:** `Orders` + `Items` + `Customers` + `Products` -> `Master_Data`.

In [17]:
import sys
import os

# Docker i√ßinden deƒüil, lokalden baƒülandƒ±ƒüƒ±mƒ±z i√ßin host'u 'localhost' yapƒ±yoruz
os.environ["POSTGRES_HOST"] = "localhost"

import pandas as pd
import polars as pl
import plotly.express as px
from sqlalchemy import create_engine, text

# Proje k√∂k dizinine eri≈üim (src mod√ºl√ºn√º bulmak i√ßin)
sys.path.append(os.path.abspath('..'))

from src.config import DATABASE_URL

print("‚úÖ K√ºt√ºphaneler y√ºklendi ve konfig√ºrasyon alƒ±ndƒ±.")

‚úÖ K√ºt√ºphaneler y√ºklendi ve konfig√ºrasyon alƒ±ndƒ±.


In [18]:
# Veritabanƒ± Baƒülantƒ±sƒ±
engine = create_engine(DATABASE_URL)

def get_table_info():
    query = """
    SELECT table_name, 
           (SELECT COUNT(*) FROM information_schema.columns WHERE table_name=t.table_name) as columns
    FROM information_schema.tables t 
    WHERE table_schema = 'public';
    """
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    return df

print("Veritabanƒ±ndaki Tablolar:")
get_table_info()

Veritabanƒ±ndaki Tablolar:


Unnamed: 0,table_name,columns
0,sellers,4
1,order_reviews,7
2,product_category_name_translation,2
3,products,9
4,orders,8
5,order_items,7
6,customers,5
7,geolocation,5
8,order_payments,5


## üöÄ Adƒ±m 1: Verileri Okuma ve Birle≈ütirme (Data Mart)

Analiz yapabilmek i√ßin par√ßalƒ± tablolarƒ± (Orders, Items, Products, Customers) tek bir **Master Tablo** haline getirmeliyiz.
Bunun i√ßin `Polars` k√ºt√ºphanesini kullanacaƒüƒ±z (Pandas'tan √ßok daha hƒ±zlƒ±dƒ±r).

In [19]:
# Tablolarƒ± Polars ile okuyoruz
print("‚è≥ Tablolar okunuyor...")

df_orders = pl.read_database("SELECT * FROM orders", connection=engine)
df_items = pl.read_database("SELECT * FROM order_items", connection=engine)
df_products = pl.read_database("SELECT * FROM products", connection=engine)
df_customers = pl.read_database("SELECT * FROM customers", connection=engine)

print(f"‚úÖ Veriler y√ºklendi:\n- Orders: {df_orders.height} satƒ±r\n- Items: {df_items.height} satƒ±r\n- Products: {df_products.height} satƒ±r\n- Customers: {df_customers.height} satƒ±r")

‚è≥ Tablolar okunuyor...
‚úÖ Veriler y√ºklendi:
- Orders: 99441 satƒ±r
- Items: 112650 satƒ±r
- Products: 32951 satƒ±r
- Customers: 99441 satƒ±r


In [20]:
# Veri Birle≈ütirme (Join ƒ∞≈ülemleri)
# Mantƒ±k: Order Items (En detaylƒ± tablo) -> Orders -> Products -> Customers

master_df = (
    df_items.join(df_orders, on="order_id", how="left")
    .join(df_products, on="product_id", how="left")
    .join(df_customers, on="customer_id", how="left")
)

print(f"üß© Master Tablo Olu≈üturuldu: {master_df.height} satƒ±r, {master_df.width} s√ºtun")
master_df.head()

üß© Master Tablo Olu≈üturuldu: 112650 satƒ±r, 26 s√ºtun


order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
str,i64,str,str,str,f64,f64,str,str,str,str,str,str,str,str,i64,i64,i64,i64,i64,i64,i64,str,i64,str,str
"""00010242fe8c5a6d1ba2dd792cb162‚Ä¶",1,"""4244733e06e7ecb4970a6e2683c13e‚Ä¶","""48436dade18ac8b2bce089ec2a0412‚Ä¶","""2017-09-19 09:45:35""",58.9,13.29,"""3ce436f183e68e07877b285a838db1‚Ä¶","""delivered""","""2017-09-13 08:59:02""","""2017-09-13 09:45:35""","""2017-09-19 18:34:16""","""2017-09-20 23:43:48""","""2017-09-29 00:00:00""","""cool_stuff""",58,598,4,650,28,9,14,"""871766c5855e863f6eccc05f988b23‚Ä¶",28013,"""campos dos goytacazes""","""RJ"""
"""00018f77f2f0320c557190d7a144bd‚Ä¶",1,"""e5f2d52b802189ee658865ca93d83a‚Ä¶","""dd7ddc04e1b6c2c614352b383efe2d‚Ä¶","""2017-05-03 11:05:13""",239.9,19.93,"""f6dd3ec061db4e3987629fe6b26e5c‚Ä¶","""delivered""","""2017-04-26 10:53:06""","""2017-04-26 11:05:13""","""2017-05-04 14:35:00""","""2017-05-12 16:04:24""","""2017-05-15 00:00:00""","""pet_shop""",56,239,2,30000,50,30,40,"""eb28e67c4c0b83846050ddfb8a35d0‚Ä¶",15775,"""santa fe do sul""","""SP"""
"""000229ec398224ef6ca0657da4fc70‚Ä¶",1,"""c777355d18b72b67abbeef9df44fd0‚Ä¶","""5b51032eddd242adc84c38acab88f2‚Ä¶","""2018-01-18 14:48:30""",199.0,17.87,"""6489ae5e4333f3693df5ad4372dab6‚Ä¶","""delivered""","""2018-01-14 14:33:31""","""2018-01-14 14:48:30""","""2018-01-16 12:36:48""","""2018-01-22 13:19:16""","""2018-02-05 00:00:00""","""moveis_decoracao""",59,695,2,3050,33,13,33,"""3818d81c6709e39d06b2738a8d3a24‚Ä¶",35661,"""para de minas""","""MG"""
"""00024acbcdf0a6daa1e931b038114c‚Ä¶",1,"""7634da152a4610f1595efa32f14722‚Ä¶","""9d7a1d34a5052409006425275ba1c2‚Ä¶","""2018-08-15 10:10:18""",12.99,12.79,"""d4eb9395c8c0431ee92fce09860c5a‚Ä¶","""delivered""","""2018-08-08 10:00:35""","""2018-08-08 10:10:18""","""2018-08-10 13:28:00""","""2018-08-14 13:32:39""","""2018-08-20 00:00:00""","""perfumaria""",42,480,1,200,16,10,15,"""af861d436cfc08b2c2ddefd0ba0746‚Ä¶",12952,"""atibaia""","""SP"""
"""00042b26cf59d7ce69dfabb4e55b4f‚Ä¶",1,"""ac6c3623068f30de03045865e4e100‚Ä¶","""df560393f3a51e74553ab94004ba5c‚Ä¶","""2017-02-13 13:57:51""",199.9,18.14,"""58dbd0b2d70206bf40e62cd34e84d7‚Ä¶","""delivered""","""2017-02-04 13:57:51""","""2017-02-04 14:10:13""","""2017-02-16 09:46:09""","""2017-03-01 16:42:31""","""2017-03-17 00:00:00""","""ferramentas_jardim""",59,409,1,3750,35,40,30,"""64b576fb70d441e8f1b2d7d446e483‚Ä¶",13226,"""varzea paulista""","""SP"""


## üîç Adƒ±m 2: EDA ve G√∂rselle≈ütirme

Veriyi birle≈ütirdik. ≈ûimdi **Plotly** kullanarak veriyi tanƒ±yalƒ±m.
Cevap aradƒ±ƒüƒ±mƒ±z sorular:
1.  Veride eksiklikler var mƒ±?
2.  Sipari≈ülerin durumu ne? (Teslim edildi, iptal vb.)
3.  Zaman i√ßinde satƒ±≈ülar nasƒ±l deƒüi≈ümi≈ü?
4.  En √ßok hangi kategoride √ºr√ºn satƒ±yoruz?

In [21]:
# 1. Eksik Veri Analizi ve Stratejisi
null_counts = master_df.null_count()
print("Eksik Veri Sayƒ±larƒ±:")
for col in null_counts.columns:
    count = null_counts[col][0]
    if count > 0:
        print(f"‚ùå {col}: {count} eksik")


Eksik Veri Sayƒ±larƒ±:
‚ùå order_approved_at: 15 eksik
‚ùå order_delivered_carrier_date: 1194 eksik
‚ùå order_delivered_customer_date: 2454 eksik
‚ùå product_category_name: 1603 eksik
‚ùå product_name_lenght: 1603 eksik
‚ùå product_description_lenght: 1603 eksik
‚ùå product_photos_qty: 1603 eksik
‚ùå product_weight_g: 18 eksik
‚ùå product_length_cm: 18 eksik
‚ùå product_height_cm: 18 eksik
‚ùå product_width_cm: 18 eksik


### üõ†Ô∏è Eksik Veri Stratejisi (Karar Anƒ±)

1.  **Teslimat Tarihleri (2454 eksik):** Bunlar hen√ºz teslim edilmemi≈ü veya iptal edilmi≈ü sipari≈üler. **Lojistik Mod√ºl√º** i√ßin sadece teslim edilmi≈ü (`delivered`) sipari≈üleri kullanacaƒüƒ±z, bu y√ºzden bu eksikler sorun deƒüil (filtrelenecek).
2.  **√úr√ºn Bilgileri (1603 eksik):** Kategori veya boyut bilgisi olmayan √ºr√ºnler. Bunlarƒ± **Growth Mod√ºl√º** i√ßin "Unknown" olarak dolduracaƒüƒ±z veya analizden √ßƒ±karacaƒüƒ±z.
3.  **Boyutlar (18 eksik):** √áok az olduƒüu i√ßin bu satƒ±rlarƒ± eƒüitim setinden √ßƒ±karabiliriz.

In [22]:
# 2. Sipari≈ü Durumu (Bar Chart - Final D√ºzeltme)
status_counts = (
    master_df.group_by("order_status")
    .len()
    .sort("len", descending=True)
    .to_pandas()
)

fig = px.bar(status_counts, x='len', y='order_status', 
             orientation='h', 
             title='Sipari≈ü Durumu Daƒüƒ±lƒ±mƒ± (Logaritmik)',
             text='len',
             color='order_status')

# Logaritmik eksen ayarlarƒ±
# range=[0, 5.5] -> 10^5.5 ‚âà 316,000 (Yazƒ±larƒ±n sƒ±ƒümasƒ± i√ßin yeterli alan)
# dtick=1 -> Sadece 10'un ana kuvvetlerini g√∂ster (10, 100, 1k...)
fig.update_xaxes(type="log", title="Sipari≈ü Sayƒ±sƒ± (Log)", range=[0, 5.5], dtick=1)
fig.update_yaxes(title="Durum")

fig.update_layout(showlegend=False, margin=dict(r=100))
fig.update_traces(textposition='outside')
fig.show()

In [23]:
# 3. Aylƒ±k Satƒ±≈ü Trendi
monthly_sales = (
    master_df.with_columns(pl.col("order_purchase_timestamp").str.to_datetime())
    .sort("order_purchase_timestamp")
    .group_by_dynamic("order_purchase_timestamp", every="1mo")
    .agg(pl.len().alias("order_count"))
    .with_columns(pl.col("order_purchase_timestamp").dt.strftime("%Y-%m").alias("month_year"))
    .to_pandas()
)

fig = px.line(monthly_sales, x='month_year', y='order_count', 
              title='Aylƒ±k Sipari≈ü Trendi',
              markers=True)

fig.update_xaxes(title='Ay', tickangle=-45)
fig.update_yaxes(title='Sipari≈ü Adedi')
fig.show()

In [24]:
# 4. En √áok Satan Kategoriler
top_categories = (
    master_df.group_by("product_category_name")
    .len()
    .sort("len", descending=True)
    .head(10)
    .to_pandas()
)

max_val = top_categories['len'].max()

fig = px.bar(top_categories, x='product_category_name', y='len',
             title='En √áok Satan 10 √úr√ºn Kategorisi',
             text_auto='.2s')

fig.update_traces(marker_color='rgb(55, 83, 109)', textposition='outside')
fig.update_layout(xaxis_title="Kategori", 
                  yaxis_title="Satƒ±≈ü Adedi", 
                  yaxis_range=[0, max_val * 1.2], 
                  margin=dict(t=50, b=100))
fig.show()

## üèÅ B√∂l√ºm √ñzeti ve Sonraki Adƒ±mlar

Bu b√∂l√ºmde veriyi tanƒ±dƒ±k ve temizledik:
1.  **Veri B√ºt√ºnl√ºƒü√º:** Sipari≈ülerin %97'si teslim edilmi≈ü (`delivered`). ƒ∞ptal oranlarƒ± d√º≈ü√ºk.
2.  **Eksik Veriler:** Teslimat tarihleri eksik olanlar (hen√ºz teslim edilmeyenler) Lojistik mod√ºl√ºnde filtrelenecek.
3.  **Kategoriler:** 'Bed_Bath_Table' ve 'Health_Beauty' en √ßok satan kategoriler.

**Sƒ±radaki Adƒ±m:** `2_logistics_engine.ipynb` ile teslimat s√ºrelerini tahmin eden makine √∂ƒürenmesi modelini kuracaƒüƒ±z.