# Urban Pulse - Data Exploration

## Initial Data Exploration and Understanding

This notebook performs the initial exploration of the traffic volume dataset to understand its structure, quality, and basic characteristics.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add src to path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

from data_processing import load_data, inspect_data

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✓ Libraries imported successfully")


## 1. Load the Dataset

First, we load the raw traffic volume dataset from the data/raw directory.


In [None]:
# Load the dataset
# NOTE: Update this path to match your actual data file location
data_path = '../data/raw/Metro_Interstate_Traffic_Volume.csv'

# Alternative: If using a different dataset name, update accordingly
# data_path = '../data/raw/traffic_volume.csv'

try:
    df_raw = load_data(data_path)
    print(f"\nDataset loaded: {df_raw.shape[0]} rows × {df_raw.shape[1]} columns")
except FileNotFoundError:
    print("⚠️  Data file not found!")
    print("Please download the Metro Interstate Traffic Volume dataset from:")
    print("  - Kaggle: https://www.kaggle.com/datasets")
    print("  - UCI ML Repository: https://archive.ics.uci.edu/ml/index.php")
    print("\nPlace the CSV file in: data/raw/Metro_Interstate_Traffic_Volume.csv")


## 2. Initial Data Inspection

Let's examine the structure and basic information about the dataset.


In [None]:
# Display first few rows
print("First 5 rows:")
print(df_raw.head())

print("\n" + "="*60)
print("Dataset Info:")
print("="*60)
df_raw.info()


In [None]:
# Generate comprehensive data quality report
data_quality_report = inspect_data(df_raw)


In [None]:
# Display basic statistics for numeric columns
print("="*60)
print("DESCRIPTIVE STATISTICS (Numeric Columns)")
print("="*60)
print(df_raw.describe())


In [None]:
# Display value counts for categorical columns
print("="*60)
print("CATEGORICAL COLUMNS ANALYSIS")
print("="*60)

categorical_cols = df_raw.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    print(f"\n{col}:")
    print(df_raw[col].value_counts().head(10))
    print(f"Unique values: {df_raw[col].nunique()}")


## 3. Data Types and Column Information

Understanding what each column represents.


In [None]:
# Column names and data types
print("Column Information:")
print("="*60)
for i, (col, dtype) in enumerate(zip(df_raw.columns, df_raw.dtypes), 1):
    print(f"{i:2d}. {col:25s} : {str(dtype):15s} | Non-null: {df_raw[col].notna().sum():6d}")


## 4. Check for Duplicates and Data Quality Issues


In [None]:
# Check for duplicate rows
duplicate_count = df_raw.duplicated().sum()
print(f"Duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print("\nSample duplicate rows:")
    print(df_raw[df_raw.duplicated(keep=False)].head(10))
else:
    print("✓ No duplicate rows found")


## 5. Initial Visualizations

Quick visualizations to understand the data distribution.


In [None]:
# Quick histogram of traffic volume (if column exists)
if 'traffic_volume' in df_raw.columns:
    fig, ax = plt.subplots(figsize=(10, 6))
    df_raw['traffic_volume'].hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
    ax.set_xlabel('Traffic Volume', fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title('Initial Traffic Volume Distribution', fontsize=14, fontweight='bold')
    ax.axvline(df_raw['traffic_volume'].mean(), color='red', linestyle='--', 
               label=f'Mean: {df_raw["traffic_volume"].mean():.0f}')
    ax.legend()
    plt.tight_layout()
    plt.show()
else:
    print("⚠️  'traffic_volume' column not found. Please check column names.")


## 6. Summary and Next Steps

**Key Findings:**
- Dataset shape and structure
- Missing values identified
- Data types confirmed
- Initial quality assessment complete

**Next Steps:**
- Proceed to `02_data_preprocessing.ipynb` for data cleaning and feature engineering
