# Exploratory Data Analysis - Retail Sales Data Warehouse

This notebook performs comprehensive exploratory data analysis on the retail sales data warehouse.

## Objectives:
- Connect to the data warehouse
- Perform data quality checks
- Analyze sales trends and patterns
- Identify top products and customers
- Generate insights for business decisions

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings

# Import custom modules
import sys
sys.path.append('..')
from src.utils.db_connection import DatabaseConnection

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully")

## 2. Database Connection

In [None]:
# Configure database connection
connection_params = {
    'db_type': 'postgresql',
    'host': 'localhost',
    'port': '5432',
    'database': 'retail_dw',
    'username': 'your_username',
    'password': 'your_password'
}

# Establish connection
# db = DatabaseConnection(connection_params)
# print("Connected to database successfully")

## 3. Data Loading

Load sample data for analysis. In production, this would query the data warehouse.

In [None]:
# Example: Load sales fact table
query_sales = """
SELECT 
    f.sale_id,
    f.sale_date,
    f.quantity,
    f.unit_price,
    f.total_amount,
    p.product_name,
    p.category,
    c.customer_name,
    c.customer_segment
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE f.sale_date >= CURRENT_DATE - INTERVAL '1 year'
"""

# df_sales = db.execute_query(query_sales)
# print(f"Loaded {len(df_sales)} sales records")
# df_sales.head()

## 4. Data Quality Checks

In [None]:
# Check for missing values
# print("Missing Values:")
# print(df_sales.isnull().sum())
# print("\nData Types:")
# print(df_sales.dtypes)
# print("\nBasic Statistics:")
# print(df_sales.describe())

## 5. Sales Trend Analysis

In [None]:
# Daily sales trend
# df_sales['sale_date'] = pd.to_datetime(df_sales['sale_date'])
# daily_sales = df_sales.groupby('sale_date')['total_amount'].sum().reset_index()

# plt.figure(figsize=(14, 6))
# plt.plot(daily_sales['sale_date'], daily_sales['total_amount'], linewidth=2)
# plt.title('Daily Sales Trend', fontsize=16, fontweight='bold')
# plt.xlabel('Date', fontsize=12)
# plt.ylabel('Total Sales ($)', fontsize=12)
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()

## 6. Product Analysis

In [None]:
# Top 10 products by revenue
# top_products = df_sales.groupby('product_name')['total_amount'].sum().sort_values(ascending=False).head(10)

# plt.figure(figsize=(12, 6))
# top_products.plot(kind='barh', color='steelblue')
# plt.title('Top 10 Products by Revenue', fontsize=16, fontweight='bold')
# plt.xlabel('Total Revenue ($)', fontsize=12)
# plt.ylabel('Product', fontsize=12)
# plt.tight_layout()
# plt.show()

## 7. Customer Segmentation Analysis

In [None]:
# Sales by customer segment
# segment_sales = df_sales.groupby('customer_segment')['total_amount'].sum().sort_values(ascending=False)

# plt.figure(figsize=(10, 6))
# plt.pie(segment_sales, labels=segment_sales.index, autopct='%1.1f%%', startangle=90)
# plt.title('Sales Distribution by Customer Segment', fontsize=16, fontweight='bold')
# plt.axis('equal')
# plt.show()

## 8. Category Performance

In [None]:
# Sales by category
# category_sales = df_sales.groupby('category').agg({
#     'total_amount': 'sum',
#     'quantity': 'sum',
#     'sale_id': 'count'
# }).rename(columns={'sale_id': 'transaction_count'})

# print("Category Performance:")
# print(category_sales.sort_values('total_amount', ascending=False))

## 9. Correlation Analysis

In [None]:
# Correlation matrix
# numeric_cols = ['quantity', 'unit_price', 'total_amount']
# correlation_matrix = df_sales[numeric_cols].corr()

# plt.figure(figsize=(8, 6))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
#             square=True, linewidths=1, cbar_kws={"shrink": 0.8})
# plt.title('Correlation Matrix', fontsize=16, fontweight='bold')
# plt.tight_layout()
# plt.show()

## 10. Key Insights and Recommendations

Based on the exploratory analysis, document key findings:

1. **Sales Trends**: [Document trends observed]
2. **Top Performers**: [List top products/categories]
3. **Customer Behavior**: [Insights from segmentation]
4. **Opportunities**: [Areas for improvement]
5. **Recommendations**: [Actionable recommendations]

In [None]:
# Generate summary report
# summary = {
#     'Total Sales': df_sales['total_amount'].sum(),
#     'Total Transactions': len(df_sales),
#     'Average Order Value': df_sales['total_amount'].mean(),
#     'Unique Customers': df_sales['customer_name'].nunique(),
#     'Unique Products': df_sales['product_name'].nunique()
# }

# print("\n" + "="*50)
# print("EXECUTIVE SUMMARY")
# print("="*50)
# for key, value in summary.items():
#     print(f"{key}: {value:,.2f}" if isinstance(value, float) else f"{key}: {value:,}")

## Conclusion

This exploratory analysis provides a foundation for deeper investigations and business intelligence reporting.