# 02 - Exploratory Data Analysis (EDA)

**Team3 - Singapore Jobs Analytics**

This notebook explores the cleaned dataset to discover key patterns and anomalies that inform dashboard design.

**Areas of Exploration:**
- Data overview and summary statistics
- Distribution of job postings by salary, experience, categories
- Salary deep dive across categories and experience bands
- Temporal trends in job postings
- Company analysis and long-tail distribution
- Engagement patterns (views, applications, application rate)

**Tools:** DuckDB (in-memory OLAP), Pandas, Matplotlib, Seaborn

---
## 1. Setup & Data Loading

In [None]:
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

con = duckdb.connect(':memory:')
print('DuckDB version:', duckdb.__version__)

In [None]:
# Load data: try cleaned parquet first, fall back to raw CSV with inline cleaning
import os

if os.path.exists('../data/processed/jobs_cleaned.parquet'):
    con.execute("""
        CREATE TABLE jobs_raw AS
        SELECT * FROM read_parquet('../data/processed/jobs_cleaned.parquet')
    """)
    print('Loaded from cleaned parquet (output of 01_data_cleaning)')
else:
    print('jobs_cleaned.parquet not found — loading from raw CSV with inline cleaning...')
    con.execute("""
        CREATE TABLE jobs_raw AS
        SELECT * FROM read_csv_auto('../data/raw/SGJobData.csv', header=true, sample_size=-1)
        WHERE metadata_jobPostId IS NOT NULL
          AND NOT (salary_minimum = 1 AND salary_maximum = 1 AND average_salary = 1)
    """)
    print('Loaded from raw CSV with cleaning applied (removed NULL IDs + placeholder salaries)')

row_count = con.execute('SELECT COUNT(*) FROM jobs_raw').fetchone()[0]
print(f'Total rows: {row_count:,}')

In [None]:
# Create analytical views for EDA
# jobs_base: normalize salary, cast dates, compute application rate
con.execute("""
CREATE OR REPLACE VIEW jobs_base AS
SELECT
  metadata_jobPostId as job_id,
  title,
  postedCompany_name as company_name,
  salary_minimum,
  salary_maximum,
  salary_type,
  CASE
    WHEN salary_minimum IS NOT NULL AND salary_maximum IS NOT NULL
    THEN (salary_minimum + salary_maximum) / 2
    ELSE average_salary
  END as avg_salary,
  salary_maximum - salary_minimum as salary_range,
  minimumYearsExperience as min_experience,
  numberOfVacancies as vacancies,
  status_jobStatus as job_status,
  TRY_CAST(metadata_originalPostingDate AS DATE) as posting_date,
  TRY_CAST(metadata_expiryDate AS DATE) as expiry_date,
  metadata_totalNumberJobApplication as applications,
  metadata_totalNumberOfView as views,
  CASE
    WHEN metadata_totalNumberOfView > 0
    THEN CAST(metadata_totalNumberJobApplication AS FLOAT) / metadata_totalNumberOfView
    ELSE NULL
  END as application_rate,
  metadata_repostCount as repost_count,
  categories
FROM jobs_raw
WHERE metadata_jobPostId IS NOT NULL
""")

# jobs_enriched: add salary bands, experience bands, time dimensions
con.execute("""
CREATE OR REPLACE VIEW jobs_enriched AS
SELECT
  *,
  CASE
    WHEN avg_salary < 3000 THEN '< 3K'
    WHEN avg_salary < 5000 THEN '3K - 5K'
    WHEN avg_salary < 8000 THEN '5K - 8K'
    WHEN avg_salary < 12000 THEN '8K - 12K'
    WHEN avg_salary < 20000 THEN '12K - 20K'
    ELSE '20K+'
  END as salary_band,
  CASE
    WHEN min_experience IS NULL OR min_experience <= 2 THEN 'Entry (0-2 years)'
    WHEN min_experience <= 5 THEN 'Mid (3-5 years)'
    WHEN min_experience <= 10 THEN 'Senior (6-10 years)'
    ELSE 'Executive (10+ years)'
  END as experience_band,
  EXTRACT(YEAR FROM posting_date) as posting_year,
  EXTRACT(MONTH FROM posting_date) as posting_month,
  EXTRACT(QUARTER FROM posting_date) as posting_quarter,
  EXTRACT(DOW FROM posting_date) as posting_day_of_week,
  expiry_date - posting_date as days_active
FROM jobs_base
WHERE avg_salary IS NOT NULL AND avg_salary > 0
""")

# jobs_categories: flatten JSON categories array
con.execute("""
CREATE OR REPLACE VIEW jobs_categories AS
WITH indices AS (
  SELECT 0 as idx UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
SELECT
  j.job_id, j.title, j.company_name,
  CAST(json_extract_string(json_extract(j.categories::JSON, '$[' || i.idx || ']'), '$.id') AS INTEGER) as category_id,
  json_extract_string(json_extract(j.categories::JSON, '$[' || i.idx || ']'), '$.category') as category_name,
  j.salary_minimum, j.salary_maximum, j.avg_salary, j.posting_date,
  j.job_status, j.min_experience, j.vacancies, j.experience_band, j.salary_band
FROM jobs_enriched j
CROSS JOIN indices i
WHERE j.categories IS NOT NULL AND j.categories != '' AND j.categories != '[]'
  AND i.idx < json_array_length(j.categories::JSON)
  AND json_extract(j.categories::JSON, '$[' || i.idx || ']') IS NOT NULL
""")

# Verify view counts
for view in ['jobs_base', 'jobs_enriched', 'jobs_categories']:
    cnt = con.execute(f'SELECT COUNT(*) FROM {view}').fetchone()[0]
    print(f'{view}: {cnt:,} rows')

---
## 2. Data Overview

In [None]:
# Schema and data types
schema = con.execute('DESCRIBE jobs_raw').fetchdf()
schema

In [None]:
# Preview first 5 rows
preview = con.execute('SELECT * FROM jobs_raw LIMIT 5').fetchdf()
preview

In [None]:
# Summary statistics of numeric columns
numeric_stats = con.execute("""
    SELECT
        COUNT(*) as total_rows,
        AVG(salary_minimum) as avg_salary_min,
        AVG(salary_maximum) as avg_salary_max,
        AVG(average_salary) as avg_salary,
        AVG(minimumYearsExperience) as avg_experience,
        AVG(numberOfVacancies) as avg_vacancies,
        AVG(metadata_totalNumberJobApplication) as avg_applications,
        AVG(metadata_totalNumberOfView) as avg_views
    FROM jobs_raw
""").fetchdf()
numeric_stats.T.rename(columns={0: 'Value'})

In [None]:
# Detailed summary statistics
detailed_stats = con.execute("""
    SELECT
        'salary_minimum' as field,
        MIN(salary_minimum) as min_val,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary_minimum) as p25,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary_minimum) as median,
        AVG(salary_minimum) as mean,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary_minimum) as p75,
        MAX(salary_minimum) as max_val,
        COUNT(salary_minimum) as non_null
    FROM jobs_raw
    UNION ALL
    SELECT 'salary_maximum',
        MIN(salary_maximum), PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary_maximum),
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary_maximum),
        AVG(salary_maximum), PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary_maximum),
        MAX(salary_maximum), COUNT(salary_maximum)
    FROM jobs_raw
    UNION ALL
    SELECT 'average_salary',
        MIN(average_salary), PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY average_salary),
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY average_salary),
        AVG(average_salary), PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY average_salary),
        MAX(average_salary), COUNT(average_salary)
    FROM jobs_raw
    UNION ALL
    SELECT 'minimumYearsExperience',
        MIN(minimumYearsExperience), PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY minimumYearsExperience),
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY minimumYearsExperience),
        AVG(minimumYearsExperience), PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY minimumYearsExperience),
        MAX(minimumYearsExperience), COUNT(minimumYearsExperience)
    FROM jobs_raw
    UNION ALL
    SELECT 'numberOfVacancies',
        MIN(numberOfVacancies), PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY numberOfVacancies),
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY numberOfVacancies),
        AVG(numberOfVacancies), PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY numberOfVacancies),
        MAX(numberOfVacancies), COUNT(numberOfVacancies)
    FROM jobs_raw
""").fetchdf()
detailed_stats

---
## 3. Univariate Analysis

In [None]:
# Salary distribution: histogram + box plot (cap at $50K to handle outliers)
salary_data = con.execute("""
    SELECT avg_salary FROM jobs_enriched
    WHERE avg_salary > 0 AND avg_salary < 50000
""").fetchdf()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(salary_data['avg_salary'], bins=50, color='#2563EB', edgecolor='white', alpha=0.8)
axes[0].set_xlabel('Average Salary ($)')
axes[0].set_ylabel('Count')
axes[0].set_title('Salary Distribution (capped at $50K)')
axes[0].axvline(salary_data['avg_salary'].median(), color='#EF4444', linestyle='--', label=f"Median: ${salary_data['avg_salary'].median():,.0f}")
axes[0].legend()

axes[1].boxplot(salary_data['avg_salary'], vert=True, patch_artist=True,
                boxprops=dict(facecolor='#2563EB', alpha=0.6),
                medianprops=dict(color='#EF4444', linewidth=2))
axes[1].set_ylabel('Average Salary ($)')
axes[1].set_title('Salary Box Plot')

plt.tight_layout()
plt.show()

print(f"Salary stats (capped at $50K):")
print(f"  Mean:   ${salary_data['avg_salary'].mean():,.0f}")
print(f"  Median: ${salary_data['avg_salary'].median():,.0f}")
print(f"  Std:    ${salary_data['avg_salary'].std():,.0f}")

In [None]:
# Salary band distribution
SALARY_BAND_ORDER = ['< 3K', '3K - 5K', '5K - 8K', '8K - 12K', '12K - 20K', '20K+']

salary_bands = con.execute("""
    SELECT salary_band, COUNT(*) as count
    FROM jobs_enriched
    GROUP BY salary_band
""").fetchdf()

salary_bands['salary_band'] = pd.Categorical(salary_bands['salary_band'], categories=SALARY_BAND_ORDER, ordered=True)
salary_bands = salary_bands.sort_values('salary_band')

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(salary_bands['salary_band'].astype(str), salary_bands['count'],
              color=['#2563EB', '#10B981', '#F59E0B', '#8B5CF6', '#EC4899', '#06B6D4'])
ax.set_xlabel('Salary Band')
ax.set_ylabel('Number of Jobs')
ax.set_title('Job Distribution by Salary Band')
for bar in bars:
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height(),
            f'{int(bar.get_height()):,}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()

In [None]:
# Experience band distribution + raw experience histogram
EXP_BAND_ORDER = ['Entry (0-2 years)', 'Mid (3-5 years)', 'Senior (6-10 years)', 'Executive (10+ years)']

exp_bands = con.execute("""
    SELECT experience_band, COUNT(*) as count
    FROM jobs_enriched
    GROUP BY experience_band
""").fetchdf()
exp_bands['experience_band'] = pd.Categorical(exp_bands['experience_band'], categories=EXP_BAND_ORDER, ordered=True)
exp_bands = exp_bands.sort_values('experience_band')

exp_raw = con.execute("""
    SELECT min_experience FROM jobs_enriched
    WHERE min_experience IS NOT NULL AND min_experience <= 30
""").fetchdf()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(exp_bands['experience_band'].astype(str), exp_bands['count'],
            color=['#10B981', '#2563EB', '#F59E0B', '#8B5CF6'])
axes[0].set_xlabel('Experience Band')
axes[0].set_ylabel('Number of Jobs')
axes[0].set_title('Job Distribution by Experience Band')
axes[0].tick_params(axis='x', rotation=15)

axes[1].hist(exp_raw['min_experience'], bins=30, color='#2563EB', edgecolor='white', alpha=0.8)
axes[1].set_xlabel('Minimum Years of Experience')
axes[1].set_ylabel('Count')
axes[1].set_title('Raw Experience Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Vacancies distribution
vacancies = con.execute("""
    SELECT vacancies FROM jobs_enriched
    WHERE vacancies IS NOT NULL AND vacancies > 0 AND vacancies <= 50
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(vacancies['vacancies'], bins=50, color='#10B981', edgecolor='white', alpha=0.8)
ax.set_xlabel('Number of Vacancies')
ax.set_ylabel('Count')
ax.set_title('Vacancies per Job Posting (capped at 50)')
ax.axvline(vacancies['vacancies'].median(), color='#EF4444', linestyle='--',
           label=f"Median: {vacancies['vacancies'].median():.0f}")
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Job status distribution
status_dist = con.execute("""
    SELECT job_status, COUNT(*) as count
    FROM jobs_base
    WHERE job_status IS NOT NULL
    GROUP BY job_status
    ORDER BY count DESC
""").fetchdf()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].barh(status_dist['job_status'], status_dist['count'], color='#2563EB')
axes[0].set_xlabel('Count')
axes[0].set_title('Job Status Distribution')

axes[1].pie(status_dist['count'], labels=status_dist['job_status'], autopct='%1.1f%%',
            colors=['#2563EB', '#10B981', '#F59E0B', '#8B5CF6', '#EC4899'][:len(status_dist)])
axes[1].set_title('Job Status Proportions')

plt.tight_layout()
plt.show()

---
## 4. Category Analysis

In [None]:
# Sample JSON structure preview
sample_cats = con.execute("""
    SELECT categories
    FROM jobs_raw
    WHERE categories IS NOT NULL AND categories != '' AND categories != '[]'
    LIMIT 3
""").fetchdf()
print('Sample categories JSON:')
for i, row in sample_cats.iterrows():
    print(f"  Row {i}: {row['categories'][:200]}")

In [None]:
# Top 20 categories by posting volume
top_categories = con.execute("""
    SELECT category_name, COUNT(*) as job_count
    FROM jobs_categories
    WHERE category_name IS NOT NULL
    GROUP BY category_name
    ORDER BY job_count DESC
    LIMIT 20
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(top_categories['category_name'][::-1], top_categories['job_count'][::-1], color='#2563EB')
ax.set_xlabel('Number of Job Postings')
ax.set_title('Top 20 Job Categories by Posting Volume')
plt.tight_layout()
plt.show()

In [None]:
# Categories per job distribution
cats_per_job = con.execute("""
    SELECT job_id, COUNT(*) as num_categories
    FROM jobs_categories
    GROUP BY job_id
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(cats_per_job['num_categories'], bins=range(1, cats_per_job['num_categories'].max() + 2),
        color='#F59E0B', edgecolor='white', alpha=0.8, align='left')
ax.set_xlabel('Number of Categories per Job')
ax.set_ylabel('Count')
ax.set_title('Categories per Job Posting')
ax.set_xticks(range(1, min(11, cats_per_job['num_categories'].max() + 1)))
plt.tight_layout()
plt.show()

print(f"Mean categories per job: {cats_per_job['num_categories'].mean():.2f}")
print(f"Max categories per job: {cats_per_job['num_categories'].max()}")

---
## 5. Salary Deep Dive

In [None]:
# Salary by top 10 categories (box plots)
top10_cats = con.execute("""
    SELECT category_name FROM jobs_categories
    WHERE category_name IS NOT NULL
    GROUP BY category_name
    ORDER BY COUNT(*) DESC
    LIMIT 10
""").fetchdf()['category_name'].tolist()

cat_salary_data = con.execute(f"""
    SELECT category_name, avg_salary
    FROM jobs_categories
    WHERE category_name IN ({', '.join(f"'{c}'" for c in top10_cats)})
      AND avg_salary > 0 AND avg_salary < 50000
""").fetchdf()

fig, ax = plt.subplots(figsize=(14, 6))
cat_order = cat_salary_data.groupby('category_name')['avg_salary'].median().sort_values(ascending=False).index
sns.boxplot(data=cat_salary_data, x='category_name', y='avg_salary', order=cat_order, ax=ax, palette='muted')
ax.set_xlabel('Category')
ax.set_ylabel('Average Salary ($)')
ax.set_title('Salary Distribution by Top 10 Categories')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Mean vs Median salary by top 15 categories
mean_median = con.execute("""
    SELECT
        category_name,
        AVG(avg_salary) as mean_salary,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_salary) as median_salary,
        COUNT(*) as job_count
    FROM jobs_categories
    WHERE category_name IS NOT NULL AND avg_salary > 0 AND avg_salary < 50000
    GROUP BY category_name
    ORDER BY mean_salary DESC
    LIMIT 15
""").fetchdf()

fig, ax = plt.subplots(figsize=(14, 6))
x = range(len(mean_median))
width = 0.35
ax.bar([i - width/2 for i in x], mean_median['mean_salary'], width, label='Mean', color='#2563EB')
ax.bar([i + width/2 for i in x], mean_median['median_salary'], width, label='Median', color='#10B981')
ax.set_xticks(x)
ax.set_xticklabels(mean_median['category_name'], rotation=45, ha='right')
ax.set_ylabel('Salary ($)')
ax.set_title('Mean vs Median Salary by Top 15 Categories')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Salary by experience band
salary_by_exp = con.execute("""
    SELECT
        experience_band,
        AVG(avg_salary) as mean_salary,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_salary) as median_salary,
        COUNT(*) as job_count
    FROM jobs_enriched
    WHERE avg_salary > 0 AND avg_salary < 50000
    GROUP BY experience_band
    ORDER BY
        CASE experience_band
            WHEN 'Entry (0-2 years)' THEN 1
            WHEN 'Mid (3-5 years)' THEN 2
            WHEN 'Senior (6-10 years)' THEN 3
            WHEN 'Executive (10+ years)' THEN 4
        END
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 5))
x = range(len(salary_by_exp))
width = 0.35
ax.bar([i - width/2 for i in x], salary_by_exp['mean_salary'], width, label='Mean', color='#2563EB')
ax.bar([i + width/2 for i in x], salary_by_exp['median_salary'], width, label='Median', color='#10B981')
ax.set_xticks(x)
ax.set_xticklabels(salary_by_exp['experience_band'])
ax.set_ylabel('Salary ($)')
ax.set_title('Salary by Experience Band')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Salary band vs experience band heatmap
SALARY_BAND_ORDER = ['< 3K', '3K - 5K', '5K - 8K', '8K - 12K', '12K - 20K', '20K+']
EXP_BAND_ORDER = ['Entry (0-2 years)', 'Mid (3-5 years)', 'Senior (6-10 years)', 'Executive (10+ years)']

heatmap_data = con.execute("""
    SELECT salary_band, experience_band, COUNT(*) as job_count
    FROM jobs_enriched
    GROUP BY salary_band, experience_band
""").fetchdf()

pivot = heatmap_data.pivot_table(index='salary_band', columns='experience_band', values='job_count', fill_value=0)
pivot = pivot.reindex(index=SALARY_BAND_ORDER, columns=EXP_BAND_ORDER)

fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(pivot, annot=True, fmt=',', cmap='YlOrRd', ax=ax)
ax.set_title('Job Count: Salary Band vs Experience Band')
ax.set_ylabel('Salary Band')
ax.set_xlabel('Experience Band')
plt.tight_layout()
plt.show()

### Salary Insights

- **Right-skewed distribution:** The salary distribution is heavily right-skewed, with a long tail of high earners. The $50K cap filters extreme outliers while preserving 99%+ of data.
- **Mean > Median gap:** Across most categories, the mean salary exceeds the median, confirming right skew. The dashboard presents both metrics to give users a realistic picture.
- **Clear experience-salary progression:** Salary increases monotonically with experience band, validating the 4-band classification used in the dashboard.
- **Entry-level concentration:** The heatmap reveals that the largest job volume sits in the Entry/Mid experience bands at the 3K-8K salary range.

---
## 6. Temporal Analysis

In [None]:
# Date range of dataset
date_range = con.execute("""
    SELECT
        MIN(posting_date) as earliest,
        MAX(posting_date) as latest,
        COUNT(DISTINCT posting_date) as unique_dates,
        DATEDIFF('day', MIN(posting_date), MAX(posting_date)) as span_days
    FROM jobs_enriched
    WHERE posting_date IS NOT NULL
""").fetchdf()
print('Date range of dataset:')
date_range

In [None]:
# Monthly posting volume + avg salary (dual-axis line)
monthly = con.execute("""
    SELECT
        DATE_TRUNC('month', posting_date) as month,
        COUNT(*) as job_count,
        AVG(avg_salary) as avg_salary
    FROM jobs_enriched
    WHERE posting_date IS NOT NULL
    GROUP BY DATE_TRUNC('month', posting_date)
    ORDER BY month
""").fetchdf()
monthly['month'] = pd.to_datetime(monthly['month'])

fig, ax1 = plt.subplots(figsize=(14, 6))
ax1.bar(monthly['month'], monthly['job_count'], width=20, alpha=0.6, color='#2563EB', label='Job Postings')
ax1.set_xlabel('Month')
ax1.set_ylabel('Number of Postings', color='#2563EB')
ax1.tick_params(axis='y', labelcolor='#2563EB')

ax2 = ax1.twinx()
ax2.plot(monthly['month'], monthly['avg_salary'], color='#EF4444', linewidth=2, marker='o', markersize=4, label='Avg Salary')
ax2.set_ylabel('Average Salary ($)', color='#EF4444')
ax2.tick_params(axis='y', labelcolor='#EF4444')

lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
ax1.set_title('Monthly Posting Volume & Average Salary')
plt.tight_layout()
plt.show()

In [None]:
# Day of week posting pattern
dow = con.execute("""
    SELECT
        posting_day_of_week as dow,
        CASE posting_day_of_week
            WHEN 0 THEN 'Sunday'
            WHEN 1 THEN 'Monday'
            WHEN 2 THEN 'Tuesday'
            WHEN 3 THEN 'Wednesday'
            WHEN 4 THEN 'Thursday'
            WHEN 5 THEN 'Friday'
            WHEN 6 THEN 'Saturday'
        END as day_name,
        COUNT(*) as job_count
    FROM jobs_enriched
    WHERE posting_day_of_week IS NOT NULL
    GROUP BY posting_day_of_week
    ORDER BY posting_day_of_week
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#F59E0B' if d in [0, 6] else '#2563EB' for d in dow['dow']]
ax.bar(dow['day_name'], dow['job_count'], color=colors)
ax.set_xlabel('Day of Week')
ax.set_ylabel('Number of Postings')
ax.set_title('Posting Volume by Day of Week (weekends highlighted in amber)')
plt.tight_layout()
plt.show()

In [None]:
# Quarterly posting volume
quarterly = con.execute("""
    SELECT
        posting_year || '-Q' || posting_quarter as quarter,
        COUNT(*) as job_count
    FROM jobs_enriched
    WHERE posting_year IS NOT NULL AND posting_quarter IS NOT NULL
    GROUP BY posting_year, posting_quarter
    ORDER BY posting_year, posting_quarter
""").fetchdf()

fig, ax = plt.subplots(figsize=(12, 5))
ax.bar(quarterly['quarter'], quarterly['job_count'], color='#10B981')
ax.set_xlabel('Quarter')
ax.set_ylabel('Number of Postings')
ax.set_title('Quarterly Posting Volume')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

### Temporal Insights

- **Weekday dominance:** Job postings are heavily concentrated on weekdays, with significantly lower volumes on weekends.
- **Seasonal patterns:** Quarterly analysis reveals hiring cycles — volume tends to dip in certain quarters, which may align with budget cycles and public holidays.
- **Salary stability:** Average salary remains relatively stable month-to-month despite volume fluctuations, suggesting the salary trends feature in the dashboard should use monthly granularity.
- **Dashboard design implication:** The Opportunity Finder tab uses quarterly trends to help users identify hiring peaks.

---
## 7. Company Analysis

In [None]:
# Top 20 companies by posting volume
top_companies = con.execute("""
    SELECT company_name, COUNT(*) as job_count
    FROM jobs_enriched
    WHERE company_name IS NOT NULL
    GROUP BY company_name
    ORDER BY job_count DESC
    LIMIT 20
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(top_companies['company_name'][::-1], top_companies['job_count'][::-1], color='#2563EB')
ax.set_xlabel('Number of Job Postings')
ax.set_title('Top 20 Companies by Posting Volume')
plt.tight_layout()
plt.show()

In [None]:
# Company distribution by posting count (long-tail analysis)
company_dist = con.execute("""
    SELECT job_count, COUNT(*) as num_companies
    FROM (
        SELECT company_name, COUNT(*) as job_count
        FROM jobs_enriched
        WHERE company_name IS NOT NULL
        GROUP BY company_name
    )
    GROUP BY job_count
    ORDER BY job_count
""").fetchdf()

fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(company_dist['job_count'], company_dist['num_companies'], alpha=0.6, color='#8B5CF6', s=20)
ax.set_xlabel('Number of Job Postings per Company')
ax.set_ylabel('Number of Companies')
ax.set_title('Long-Tail Distribution: Companies by Posting Count')
ax.set_xscale('log')
ax.set_yscale('log')
plt.tight_layout()
plt.show()

total_companies = con.execute("SELECT COUNT(DISTINCT company_name) FROM jobs_enriched WHERE company_name IS NOT NULL").fetchone()[0]
single_post = company_dist[company_dist['job_count'] == 1]['num_companies'].iloc[0] if 1 in company_dist['job_count'].values else 0
print(f"Total unique companies: {total_companies:,}")
print(f"Companies with only 1 posting: {single_post:,} ({single_post/total_companies*100:.1f}%)")

In [None]:
# Top 15 highest-paying companies (minimum 5 postings)
high_pay_companies = con.execute("""
    SELECT company_name,
           AVG(avg_salary) as mean_salary,
           PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_salary) as median_salary,
           COUNT(*) as job_count
    FROM jobs_enriched
    WHERE company_name IS NOT NULL AND avg_salary > 0 AND avg_salary < 50000
    GROUP BY company_name
    HAVING COUNT(*) >= 5
    ORDER BY mean_salary DESC
    LIMIT 15
""").fetchdf()

fig, ax = plt.subplots(figsize=(12, 6))
x = range(len(high_pay_companies))
width = 0.35
ax.bar([i - width/2 for i in x], high_pay_companies['mean_salary'], width, label='Mean', color='#2563EB')
ax.bar([i + width/2 for i in x], high_pay_companies['median_salary'], width, label='Median', color='#10B981')
ax.set_xticks(x)
ax.set_xticklabels(high_pay_companies['company_name'], rotation=45, ha='right')
ax.set_ylabel('Salary ($)')
ax.set_title('Top 15 Highest-Paying Companies (min 5 postings)')
ax.legend()
plt.tight_layout()
plt.show()

---
## 8. Engagement Analysis

In [None]:
# Views vs applications scatter (colored by salary)
engagement = con.execute("""
    SELECT views, applications, avg_salary, salary_band
    FROM jobs_enriched
    WHERE views IS NOT NULL AND views > 0
      AND applications IS NOT NULL AND applications >= 0
      AND views < 10000 AND applications < 1000
""").fetchdf()

fig, ax = plt.subplots(figsize=(12, 6))
scatter = ax.scatter(engagement['views'], engagement['applications'],
                     c=engagement['avg_salary'], cmap='viridis',
                     alpha=0.3, s=10)
plt.colorbar(scatter, label='Average Salary ($)')
ax.set_xlabel('Views')
ax.set_ylabel('Applications')
ax.set_title('Views vs Applications (colored by salary)')
plt.tight_layout()
plt.show()

In [None]:
# Application rate distribution
app_rate = con.execute("""
    SELECT application_rate
    FROM jobs_enriched
    WHERE application_rate IS NOT NULL AND application_rate > 0 AND application_rate < 1
""").fetchdf()

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(app_rate['application_rate'], bins=50, color='#10B981', edgecolor='white', alpha=0.8)
ax.set_xlabel('Application Rate (applications / views)')
ax.set_ylabel('Count')
ax.set_title('Application Rate Distribution')
ax.axvline(app_rate['application_rate'].median(), color='#EF4444', linestyle='--',
           label=f"Median: {app_rate['application_rate'].median():.3f}")
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Application rate by category
cat_app_rate = con.execute("""
    SELECT
        jc.category_name,
        AVG(je.application_rate) as avg_app_rate,
        COUNT(*) as job_count
    FROM jobs_categories jc
    JOIN jobs_enriched je ON jc.job_id = je.job_id
    WHERE je.application_rate IS NOT NULL AND je.application_rate > 0 AND je.application_rate < 1
      AND jc.category_name IS NOT NULL
    GROUP BY jc.category_name
    HAVING COUNT(*) >= 100
    ORDER BY avg_app_rate DESC
    LIMIT 15
""").fetchdf()

fig, ax = plt.subplots(figsize=(12, 6))
ax.barh(cat_app_rate['category_name'][::-1], cat_app_rate['avg_app_rate'][::-1], color='#8B5CF6')
ax.set_xlabel('Average Application Rate')
ax.set_title('Application Rate by Category (min 100 postings)')
plt.tight_layout()
plt.show()

### Engagement Insights

- **Positive correlation:** Views and applications show a clear positive relationship, but the scatter reveals significant variance — some high-view jobs get few applications, suggesting role-specific barriers.
- **Low application rates:** Most jobs have application rates well below 10%, indicating that job seekers are selective or that many postings attract passive viewers.
- **Category variation:** Application rates vary significantly by category, revealing which sectors face talent shortages (low rates) vs. high competition (high rates).
- **Dashboard design:** The application rate metric (applications/views) is included in the dashboard as a "competition indicator" to help users identify less competitive opportunities.

---
## 9. Key Findings Summary

### Data Quality
- Dataset contains 1M+ job postings with 22 columns
- Core fields have zero null rates; `occupationId` is 100% null and excluded
- JSON fields (`categories`) require careful parsing; `positionLevels` and `employmentTypes` are plain strings
- Salary data is clean with no invalid ranges (max < min)

### Salary Insights
- Distribution is right-skewed — median is more representative than mean
- The $50K salary cap effectively removes extreme outliers while preserving 99%+ of data
- Clear salary progression across experience bands validates the 4-band classification
- Significant salary variation across categories — some pay 2-3x more than others

### Market Structure
- Long-tail company distribution: a few large employers dominate posting volume
- Top 20 categories account for the majority of all categorized postings
- Most jobs are tagged with 1-3 categories

### Temporal Patterns
- Strong weekday bias in posting activity
- Seasonal hiring patterns visible at quarterly granularity
- Salary levels remain relatively stable over time

### Engagement Patterns
- Most jobs have low application rates (<10%)
- Positive but noisy correlation between views and applications
- Application rates vary significantly by category

### Dashboard Design Decisions Informed by EDA
1. **$50K salary cap** — removes outliers without losing meaningful data
2. **6 salary bands** — match natural clusters in the distribution
3. **4 experience bands** — align with clear salary progression steps
4. **Application rate as competition metric** — provides actionable signal
5. **Monthly granularity for trends** — balances detail with readability
6. **Category-first navigation** — categories have the best coverage
7. **Mean + Median presentation** — both metrics shown to account for skewness
8. **Company comparison mode** — long-tail distribution means top employers are meaningful benchmarks

In [None]:
# Clean up
con.close()
print('EDA complete. All views and connections closed.')