# Phase 2: Exploratory Data Analysis (EDA)

This notebook performs comprehensive exploratory data analysis on the cleaned Used Cars Dataset.

**Objectives:**
- Univariate analysis of target and features
- Bivariate analysis (feature-target relationships)
- Multivariate analysis (feature interactions)
- Geographic analysis (price by location)
- Temporal analysis (price trends over time)
- Identify key insights for modeling

## 1. Setup & Configuration

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats
from scipy.stats import pearsonr, spearmanr
import os

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set random seed
np.random.seed(42)

# Set plot style and color palette
sns.set_style('whitegrid')
COLORS = {
    'primary': '#2E86AB',
    'secondary': '#A23B72',
    'accent': '#F18F01',
    'error': '#C73E1D',
    'success': '#6A994E'
}

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Define project paths
PROJECT_ROOT = '/content/drive/MyDrive/used_car_price_prediction'
DATA_PROCESSED = os.path.join(PROJECT_ROOT, 'data/processed')
RESULTS_FIGURES = os.path.join(PROJECT_ROOT, 'results/figures/eda_plots')

# Create directories if they don't exist
os.makedirs(RESULTS_FIGURES, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Processed data path: {DATA_PROCESSED}")
print(f"Figures path: {RESULTS_FIGURES}")

## 2. Load Cleaned Dataset

In [None]:
# Load cleaned dataset from Phase 1
file_path = os.path.join(DATA_PROCESSED, 'vehicles_cleaned.csv')
df = pd.read_csv(file_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# Convert categorical columns back to category dtype
categorical_features = ['region', 'manufacturer', 'model', 'condition', 'cylinders', 
                       'fuel', 'title_status', 'transmission', 'drive', 
                       'type', 'paint_color', 'state']

for feature in categorical_features:
    if feature in df.columns:
        df[feature] = df[feature].astype('category')

# Convert posting_date to datetime
if 'posting_date' in df.columns:
    df['posting_date'] = pd.to_datetime(df['posting_date'])

print("Data types optimized!")

In [None]:
# Display first few rows
df.head(10)

In [None]:
# Dataset information
df.info()

## 3. Univariate Analysis

Analyzing the distribution of individual features.

### 3.1 Target Variable: Price

In [None]:
# Price statistics
print("Price Distribution Statistics:")
print("=" * 60)
print(df['price'].describe())
print(f"\nSkewness: {df['price'].skew():.3f}")
print(f"Kurtosis: {df['price'].kurtosis():.3f}")
print("=" * 60)

In [None]:
# Price distribution visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Histogram
axes[0, 0].hist(df['price'], bins=100, color=COLORS['primary'], edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df['price'].mean(), color=COLORS['error'], linestyle='--', linewidth=2, label=f'Mean: ${df["price"].mean():,.0f}')
axes[0, 0].axvline(df['price'].median(), color=COLORS['success'], linestyle='--', linewidth=2, label=f'Median: ${df["price"].median():,.0f}')
axes[0, 0].set_xlabel('Price ($)', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)
axes[0, 0].set_title('Price Distribution', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)

# Box plot
axes[0, 1].boxplot(df['price'], vert=True, patch_artist=True,
                   boxprops=dict(facecolor=COLORS['primary'], alpha=0.7),
                   medianprops=dict(color='black', linewidth=2))
axes[0, 1].set_ylabel('Price ($)', fontsize=12)
axes[0, 1].set_title('Price Box Plot', fontsize=14, fontweight='bold')
axes[0, 1].grid(axis='y', alpha=0.3)

# Log-transformed histogram
axes[1, 0].hist(np.log10(df['price']), bins=100, color=COLORS['secondary'], edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Log10(Price)', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)
axes[1, 0].set_title('Price Distribution (Log Scale)', fontsize=14, fontweight='bold')
axes[1, 0].grid(axis='y', alpha=0.3)

# Q-Q plot (normality test)
stats.probplot(df['price'], dist="norm", plot=axes[1, 1])
axes[1, 1].get_lines()[0].set_color(COLORS['primary'])
axes[1, 1].get_lines()[0].set_markersize(3)
axes[1, 1].get_lines()[1].set_color(COLORS['error'])
axes[1, 1].set_title('Q-Q Plot (Price vs Normal Distribution)', fontsize=14, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_01_price_distribution.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_01_price_distribution.png")

### 3.2 Numerical Features

In [None]:
# Numerical features statistics
numerical_features = ['year', 'odometer', 'lat', 'long']

print("Numerical Features Statistics:")
print("=" * 80)
print(df[numerical_features].describe())
print("=" * 80)

In [None]:
# Distribution of numerical features
fig, axes = plt.subplots(2, 4, figsize=(20, 10))

for idx, feature in enumerate(numerical_features):
    row = idx // 2
    col_hist = (idx % 2) * 2
    col_box = col_hist + 1
    
    # Histogram
    axes[row, col_hist].hist(df[feature], bins=50, color=COLORS['primary'], 
                             edgecolor='black', alpha=0.7)
    axes[row, col_hist].axvline(df[feature].mean(), color=COLORS['error'], 
                                linestyle='--', linewidth=2, label='Mean')
    axes[row, col_hist].axvline(df[feature].median(), color=COLORS['success'], 
                                linestyle='--', linewidth=2, label='Median')
    axes[row, col_hist].set_xlabel(feature.title(), fontsize=11)
    axes[row, col_hist].set_ylabel('Frequency', fontsize=11)
    axes[row, col_hist].set_title(f'{feature.title()} Distribution', fontsize=12, fontweight='bold')
    axes[row, col_hist].legend(fontsize=9)
    axes[row, col_hist].grid(axis='y', alpha=0.3)
    
    # Box plot
    axes[row, col_box].boxplot(df[feature], vert=True, patch_artist=True,
                               boxprops=dict(facecolor=COLORS['primary'], alpha=0.7),
                               medianprops=dict(color='black', linewidth=2))
    axes[row, col_box].set_ylabel(feature.title(), fontsize=11)
    axes[row, col_box].set_title(f'{feature.title()} Box Plot', fontsize=12, fontweight='bold')
    axes[row, col_box].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_02_numerical_distributions.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_02_numerical_distributions.png")

### 3.3 Categorical Features

In [None]:
# Categorical features summary
print("Categorical Features Summary:")
print("=" * 80)

cat_summary = []
for feature in categorical_features:
    if feature in df.columns:
        n_unique = df[feature].nunique()
        most_common = df[feature].mode()[0]
        most_common_freq = (df[feature] == most_common).sum()
        most_common_pct = (most_common_freq / len(df)) * 100
        
        cat_summary.append({
            'Feature': feature,
            'Unique_Values': n_unique,
            'Most_Common': most_common,
            'Frequency': most_common_freq,
            'Percentage': f'{most_common_pct:.2f}%'
        })

cat_summary_df = pd.DataFrame(cat_summary)
print(cat_summary_df.to_string(index=False))
print("=" * 80)

In [None]:
# Visualize low-cardinality categorical features
low_card_features = ['fuel', 'transmission', 'drive', 'type', 'condition', 'title_status']
low_card_features = [f for f in low_card_features if f in df.columns]

n_features = len(low_card_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 5 * n_rows))
axes = axes.flatten() if n_features > 1 else [axes]

for idx, feature in enumerate(low_card_features):
    value_counts = df[feature].value_counts().head(10)
    
    axes[idx].barh(range(len(value_counts)), value_counts.values, 
                   color=COLORS['primary'], alpha=0.7)
    axes[idx].set_yticks(range(len(value_counts)))
    axes[idx].set_yticklabels(value_counts.index, fontsize=10)
    axes[idx].set_xlabel('Count', fontsize=11)
    axes[idx].set_title(f'{feature.title()} Distribution', fontsize=12, fontweight='bold')
    axes[idx].grid(axis='x', alpha=0.3)
    axes[idx].invert_yaxis()
    
    # Add percentage labels
    for i, v in enumerate(value_counts.values):
        pct = (v / len(df)) * 100
        axes[idx].text(v, i, f'  {v:,} ({pct:.1f}%)', va='center', fontsize=9)

# Hide unused subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_03_categorical_distributions.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_03_categorical_distributions.png")

In [None]:
# Visualize high-cardinality categorical features (top 15)
high_card_features = ['manufacturer', 'state', 'paint_color']
high_card_features = [f for f in high_card_features if f in df.columns]

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, feature in enumerate(high_card_features):
    value_counts = df[feature].value_counts().head(15)
    
    axes[idx].barh(range(len(value_counts)), value_counts.values, 
                   color=COLORS['secondary'], alpha=0.7)
    axes[idx].set_yticks(range(len(value_counts)))
    axes[idx].set_yticklabels(value_counts.index, fontsize=9)
    axes[idx].set_xlabel('Count', fontsize=11)
    axes[idx].set_title(f'{feature.title()} - Top 15', fontsize=12, fontweight='bold')
    axes[idx].grid(axis='x', alpha=0.3)
    axes[idx].invert_yaxis()

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_04_high_cardinality_features.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_04_high_cardinality_features.png")

## 4. Bivariate Analysis

Analyzing relationships between features and the target variable (price).

### 4.1 Numerical Features vs. Price

In [None]:
# Correlation analysis
numerical_cols = ['price', 'year', 'odometer', 'lat', 'long']
correlation_matrix = df[numerical_cols].corr()

print("Correlation with Price:")
print("=" * 40)
price_corr = correlation_matrix['price'].sort_values(ascending=False)
for feature, corr in price_corr.items():
    if feature != 'price':
        print(f"{feature:15s}: {corr:6.3f}")
print("=" * 40)

In [None]:
# Scatter plots: Numerical features vs. Price
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Sample data for performance (max 10000 points per plot)
sample_size = min(10000, len(df))
df_sample = df.sample(sample_size, random_state=42)

# Year vs Price
axes[0, 0].scatter(df_sample['year'], df_sample['price'], alpha=0.3, s=1, color=COLORS['primary'])
axes[0, 0].set_xlabel('Year', fontsize=12)
axes[0, 0].set_ylabel('Price ($)', fontsize=12)
axes[0, 0].set_title(f'Year vs Price (correlation: {price_corr["year"]:.3f})', 
                     fontsize=13, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Odometer vs Price
axes[0, 1].scatter(df_sample['odometer'], df_sample['price'], alpha=0.3, s=1, color=COLORS['secondary'])
axes[0, 1].set_xlabel('Odometer (miles)', fontsize=12)
axes[0, 1].set_ylabel('Price ($)', fontsize=12)
axes[0, 1].set_title(f'Odometer vs Price (correlation: {price_corr["odometer"]:.3f})', 
                     fontsize=13, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Latitude vs Price
axes[1, 0].scatter(df_sample['lat'], df_sample['price'], alpha=0.3, s=1, color=COLORS['accent'])
axes[1, 0].set_xlabel('Latitude', fontsize=12)
axes[1, 0].set_ylabel('Price ($)', fontsize=12)
axes[1, 0].set_title(f'Latitude vs Price (correlation: {price_corr["lat"]:.3f})', 
                     fontsize=13, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Longitude vs Price
axes[1, 1].scatter(df_sample['long'], df_sample['price'], alpha=0.3, s=1, color=COLORS['error'])
axes[1, 1].set_xlabel('Longitude', fontsize=12)
axes[1, 1].set_ylabel('Price ($)', fontsize=12)
axes[1, 1].set_title(f'Longitude vs Price (correlation: {price_corr["long"]:.3f})', 
                     fontsize=13, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_05_numerical_vs_price.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_05_numerical_vs_price.png")

### 4.2 Categorical Features vs. Price

In [None]:
# Box plots: Key categorical features vs. Price
key_cat_features = ['manufacturer', 'condition', 'fuel', 'transmission', 'drive', 'type']
key_cat_features = [f for f in key_cat_features if f in df.columns]

n_features = len(key_cat_features)
n_cols = 2
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 5 * n_rows))
axes = axes.flatten() if n_features > 1 else [axes]

for idx, feature in enumerate(key_cat_features):
    # Get top categories by frequency
    top_categories = df[feature].value_counts().head(10).index
    df_subset = df[df[feature].isin(top_categories)]
    
    # Create box plot
    df_subset.boxplot(column='price', by=feature, ax=axes[idx], patch_artist=True)
    axes[idx].set_xlabel(feature.title(), fontsize=11)
    axes[idx].set_ylabel('Price ($)', fontsize=11)
    axes[idx].set_title(f'Price by {feature.title()} (Top 10)', fontsize=12, fontweight='bold')
    axes[idx].tick_params(axis='x', rotation=45, labelsize=9)
    axes[idx].grid(alpha=0.3)
    plt.setp(axes[idx].xaxis.get_majorticklabels(), rotation=45, ha='right')

# Hide unused subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_06_categorical_vs_price.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_06_categorical_vs_price.png")

In [None]:
# Average price by category
print("Average Price by Key Categories:")
print("=" * 80)

for feature in ['manufacturer', 'condition', 'fuel', 'transmission', 'type']:
    if feature in df.columns:
        avg_price = df.groupby(feature)['price'].agg(['mean', 'median', 'count']).sort_values('mean', ascending=False)
        print(f"\n{feature.upper()} - Top 10 by Average Price:")
        print(avg_price.head(10).to_string())
        print("-" * 80)

## 5. Multivariate Analysis

### 5.1 Correlation Matrix

In [None]:
# Correlation heatmap for numerical features
fig, ax = plt.subplots(figsize=(10, 8))

# Create correlation matrix
corr_matrix = df[numerical_cols].corr()

# Create heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title('Correlation Matrix - Numerical Features', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_07_correlation_matrix.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_07_correlation_matrix.png")

### 5.2 Feature Interactions

In [None]:
# Year and Odometer interaction with Price
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Create bins for year and odometer
df_sample = df.sample(min(10000, len(df)), random_state=42)

# Year bins
df_sample['year_bin'] = pd.cut(df_sample['year'], bins=5, labels=['Very Old', 'Old', 'Medium', 'Recent', 'New'])

# Odometer bins
df_sample['odometer_bin'] = pd.qcut(df_sample['odometer'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Plot 1: Price by Year and Odometer bins
for odometer_cat in df_sample['odometer_bin'].cat.categories:
    subset = df_sample[df_sample['odometer_bin'] == odometer_cat]
    avg_prices = subset.groupby('year_bin')['price'].mean()
    axes[0].plot(avg_prices.index, avg_prices.values, marker='o', label=f'Odometer: {odometer_cat}', linewidth=2)

axes[0].set_xlabel('Year Category', fontsize=12)
axes[0].set_ylabel('Average Price ($)', fontsize=12)
axes[0].set_title('Price by Year and Odometer Categories', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Price by Manufacturer and Condition
top_manufacturers = df['manufacturer'].value_counts().head(8).index
df_subset = df[df['manufacturer'].isin(top_manufacturers)]

condition_order = ['salvage', 'fair', 'good', 'excellent', 'like new', 'new']
available_conditions = [c for c in condition_order if c in df_subset['condition'].unique()]

price_by_mfr_cond = df_subset.groupby(['manufacturer', 'condition'])['price'].mean().unstack(fill_value=0)
price_by_mfr_cond = price_by_mfr_cond[available_conditions] if len(available_conditions) > 0 else price_by_mfr_cond

price_by_mfr_cond.plot(kind='bar', ax=axes[1], width=0.8, colormap='viridis')
axes[1].set_xlabel('Manufacturer', fontsize=12)
axes[1].set_ylabel('Average Price ($)', fontsize=12)
axes[1].set_title('Price by Manufacturer and Condition', fontsize=13, fontweight='bold')
axes[1].legend(title='Condition', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_08_feature_interactions.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_08_feature_interactions.png")

## 6. Geographic Analysis

In [None]:
# Price by state
state_stats = df.groupby('state')['price'].agg(['mean', 'median', 'count']).sort_values('mean', ascending=False)

print("Top 15 States by Average Price:")
print("=" * 60)
print(state_stats.head(15).to_string())
print("=" * 60)

In [None]:
# Geographic visualizations
fig, axes = plt.subplots(2, 2, figsize=(18, 14))

# Sample data for performance
sample_size = min(20000, len(df))
df_geo_sample = df.sample(sample_size, random_state=42)

# Plot 1: Geographic distribution of listings
axes[0, 0].scatter(df_geo_sample['long'], df_geo_sample['lat'], 
                   alpha=0.2, s=1, color=COLORS['primary'])
axes[0, 0].set_xlabel('Longitude', fontsize=12)
axes[0, 0].set_ylabel('Latitude', fontsize=12)
axes[0, 0].set_title(f'Geographic Distribution of Listings (n={sample_size:,})', 
                     fontsize=13, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Plot 2: Price heatmap by location
scatter = axes[0, 1].scatter(df_geo_sample['long'], df_geo_sample['lat'], 
                             c=df_geo_sample['price'], cmap='YlOrRd', 
                             alpha=0.5, s=3, vmin=df['price'].quantile(0.05), 
                             vmax=df['price'].quantile(0.95))
axes[0, 1].set_xlabel('Longitude', fontsize=12)
axes[0, 1].set_ylabel('Latitude', fontsize=12)
axes[0, 1].set_title('Price by Geographic Location', fontsize=13, fontweight='bold')
plt.colorbar(scatter, ax=axes[0, 1], label='Price ($)')
axes[0, 1].grid(alpha=0.3)

# Plot 3: Average price by state (top 20)
top_20_states = state_stats.head(20)
axes[1, 0].barh(range(len(top_20_states)), top_20_states['mean'].values, 
                color=COLORS['secondary'], alpha=0.7)
axes[1, 0].set_yticks(range(len(top_20_states)))
axes[1, 0].set_yticklabels(top_20_states.index)
axes[1, 0].set_xlabel('Average Price ($)', fontsize=12)
axes[1, 0].set_title('Top 20 States by Average Price', fontsize=13, fontweight='bold')
axes[1, 0].invert_yaxis()
axes[1, 0].grid(axis='x', alpha=0.3)

# Plot 4: Number of listings by state (top 20)
top_20_states_count = df['state'].value_counts().head(20)
axes[1, 1].barh(range(len(top_20_states_count)), top_20_states_count.values, 
                color=COLORS['accent'], alpha=0.7)
axes[1, 1].set_yticks(range(len(top_20_states_count)))
axes[1, 1].set_yticklabels(top_20_states_count.index)
axes[1, 1].set_xlabel('Number of Listings', fontsize=12)
axes[1, 1].set_title('Top 20 States by Number of Listings', fontsize=13, fontweight='bold')
axes[1, 1].invert_yaxis()
axes[1, 1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_09_geographic_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_09_geographic_analysis.png")

## 7. Temporal Analysis

In [None]:
# Extract temporal features
df['posting_month'] = df['posting_date'].dt.month
df['posting_day_of_week'] = df['posting_date'].dt.dayofweek
df['posting_year_month'] = df['posting_date'].dt.to_period('M')

print("Temporal features extracted successfully!")

In [None]:
# Temporal analysis visualizations
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Plot 1: Average price over time (monthly)
monthly_price = df.groupby('posting_year_month')['price'].agg(['mean', 'count'])
monthly_price.index = monthly_price.index.to_timestamp()

axes[0, 0].plot(monthly_price.index, monthly_price['mean'], 
                color=COLORS['primary'], linewidth=2, marker='o', markersize=4)
axes[0, 0].set_xlabel('Date', fontsize=12)
axes[0, 0].set_ylabel('Average Price ($)', fontsize=12)
axes[0, 0].set_title('Average Price Trend Over Time', fontsize=13, fontweight='bold')
axes[0, 0].grid(alpha=0.3)
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Number of listings over time
axes[0, 1].bar(monthly_price.index, monthly_price['count'], 
               color=COLORS['secondary'], alpha=0.7, width=20)
axes[0, 1].set_xlabel('Date', fontsize=12)
axes[0, 1].set_ylabel('Number of Listings', fontsize=12)
axes[0, 1].set_title('Listing Volume Over Time', fontsize=13, fontweight='bold')
axes[0, 1].grid(axis='y', alpha=0.3)
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Average price by month of year
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_avg = df.groupby('posting_month')['price'].mean()

axes[1, 0].bar(range(1, 13), monthly_avg.values, color=COLORS['accent'], alpha=0.7)
axes[1, 0].set_xticks(range(1, 13))
axes[1, 0].set_xticklabels(month_names)
axes[1, 0].set_xlabel('Month', fontsize=12)
axes[1, 0].set_ylabel('Average Price ($)', fontsize=12)
axes[1, 0].set_title('Average Price by Month of Year', fontsize=13, fontweight='bold')
axes[1, 0].grid(axis='y', alpha=0.3)

# Plot 4: Average price by day of week
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
daily_avg = df.groupby('posting_day_of_week')['price'].mean()

axes[1, 1].bar(range(7), daily_avg.values, color=COLORS['success'], alpha=0.7)
axes[1, 1].set_xticks(range(7))
axes[1, 1].set_xticklabels(day_names)
axes[1, 1].set_xlabel('Day of Week', fontsize=12)
axes[1, 1].set_ylabel('Average Price ($)', fontsize=12)
axes[1, 1].set_title('Average Price by Day of Week', fontsize=13, fontweight='bold')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(RESULTS_FIGURES, 'eda_10_temporal_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved: eda_10_temporal_analysis.png")

## 8. Key Insights Summary

In [None]:
# Generate comprehensive insights
print("=" * 80)
print("KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS")
print("=" * 80)

print("\n1. TARGET VARIABLE (PRICE):")
print("-" * 80)
print(f"   - Mean: ${df['price'].mean():,.2f}")
print(f"   - Median: ${df['price'].median():,.2f}")
print(f"   - Std Dev: ${df['price'].std():,.2f}")
print(f"   - Skewness: {df['price'].skew():.3f} (right-skewed distribution)")
print(f"   - Price range: ${df['price'].min():,.0f} - ${df['price'].max():,.0f}")

print("\n2. STRONGEST CORRELATIONS WITH PRICE:")
print("-" * 80)
price_correlations = df[numerical_cols].corr()['price'].sort_values(ascending=False)
for feature, corr in price_correlations.items():
    if feature != 'price':
        direction = 'positive' if corr > 0 else 'negative'
        strength = 'strong' if abs(corr) > 0.5 else 'moderate' if abs(corr) > 0.3 else 'weak'
        print(f"   - {feature}: {corr:.3f} ({strength} {direction})")

print("\n3. CATEGORICAL FEATURES IMPACT:")
print("-" * 80)
for feature in ['manufacturer', 'condition', 'fuel', 'type']:
    if feature in df.columns:
        top_cat = df.groupby(feature)['price'].mean().sort_values(ascending=False).head(3)
        print(f"   {feature.upper()} - Top 3 by avg price:")
        for cat, price in top_cat.items():
            print(f"      {cat}: ${price:,.2f}")

print("\n4. GEOGRAPHIC INSIGHTS:")
print("-" * 80)
top_states = state_stats.head(5)
print("   Top 5 states by average price:")
for state, row in top_states.iterrows():
    print(f"      {state}: ${row['mean']:,.2f} (n={row['count']:,})")

print("\n5. TEMPORAL PATTERNS:")
print("-" * 80)
highest_month = monthly_avg.idxmax()
lowest_month = monthly_avg.idxmin()
print(f"   - Highest avg price month: {month_names[highest_month-1]} (${monthly_avg[highest_month]:,.2f})")
print(f"   - Lowest avg price month: {month_names[lowest_month-1]} (${monthly_avg[lowest_month]:,.2f})")
print(f"   - Seasonal variation: ${monthly_avg.max() - monthly_avg.min():,.2f}")

print("\n6. DATA QUALITY:")
print("-" * 80)
print(f"   - Total records: {len(df):,}")
print(f"   - Total features: {len(df.columns)}")
print(f"   - Missing values: {df.isna().sum().sum()} (complete dataset)")
print(f"   - Duplicate records: 0 (removed in cleaning)")

print("\n" + "=" * 80)

## 9. Recommendations for Modeling

In [None]:
print("=" * 80)
print("RECOMMENDATIONS FOR MODELING")
print("=" * 80)

print("\n1. FEATURE ENGINEERING OPPORTUNITIES:")
print("-" * 80)
print("   - Create vehicle_age feature (2025 - year)")
print("   - Create odometer_per_year feature (odometer / vehicle_age)")
print("   - Consider price range categories for stratification")
print("   - Extract temporal features from posting_date")
print("   - Consider regional groupings for geographic features")

print("\n2. ENCODING STRATEGIES:")
print("-" * 80)
print("   - Low cardinality (<15): One-hot encoding")
print("     Features: fuel, transmission, drive, condition, title_status, type, cylinders")
print("   - Medium cardinality (15-100): Target encoding or frequency encoding")
print("     Features: manufacturer, state, paint_color")
print("   - High cardinality (>100): Target encoding with regularization")
print("     Features: model, region")

print("\n3. FEATURE SCALING:")
print("-" * 80)
print("   - StandardScaler for Ridge Regression: year, odometer, lat, long")
print("   - No scaling needed for tree-based models (Random Forest, XGBoost, CatBoost)")

print("\n4. TARGET VARIABLE TRANSFORMATION:")
print("-" * 80)
print(f"   - Price is right-skewed (skewness: {df['price'].skew():.3f})")
print("   - Consider log transformation for linear models")
print("   - Tree-based models can handle skewness naturally")

print("\n5. IMPORTANT FEATURES IDENTIFIED:")
print("-" * 80)
print("   Strong predictors:")
for feature, corr in price_correlations.items():
    if feature != 'price' and abs(corr) > 0.3:
        print(f"      - {feature} (correlation: {corr:.3f})")
print("   Key categorical features:")
print("      - manufacturer, condition, type, fuel")

print("\n6. CROSS-VALIDATION STRATEGY:")
print("-" * 80)
print("   - Use Stratified K-Fold based on price bins")
print("   - Ensures balanced price distribution across folds")
print("   - Recommended: 5-fold cross-validation")

print("\n" + "=" * 80)

## Summary

**EDA Completed:**
- Comprehensive univariate analysis of all features
- Bivariate analysis revealing feature-target relationships
- Multivariate analysis showing feature interactions
- Geographic analysis identifying regional price patterns
- Temporal analysis revealing seasonal trends
- Key insights documented for modeling decisions

**Key Findings:**
- Year shows strongest positive correlation with price
- Odometer shows negative correlation (as expected)
- Manufacturer, condition, and type are important categorical predictors
- Geographic location influences pricing
- Some seasonal variation in pricing patterns

**Next Phase:**
- Proceed to Phase 3: Feature Engineering