# Task 2: Data Visualization & Analytics

This section explores key dimensions of the job postings dataset through multiple Vega-Altair visualizations.  
Each visualization highlights a different analytical aspect — **temporal**, **categorical**, **hierarchical**, and **spatial** — to uncover insights about hiring trends, skill demand, role distribution, and geographic salary variation.

In [None]:
import re
import pandas as pd
import altair as alt
from ast import literal_eval
from vega_datasets import data as vega_data

alt.data_transformers.disable_max_rows()
alt.renderers.enable('svg')

## Visualization 1 — Monthly Job Posting Trends

**Goal:**  
Analyze temporal hiring patterns by aggregating job postings by month.

**Insights:**  
- Job postings show notable seasonality, peaking around **March** and **September**, indicating recruitment surges during these periods.  
- Dips in **February**, **August**, and **November** suggest potential slowdowns in hiring cycles.

In [2]:
# Load dataset
df = pd.read_csv("../data/monster_com-job_sample.csv", low_memory=False)
df["post_date"] = pd.to_datetime(df["date_added"], errors="coerce")

# Extract month name (ordered)
df["month"] = df["post_date"].dt.month_name()
month_order = ["January", "February", "March", "April", "May", "June",
               "July", "August", "September", "October", "November", "December"]

# Aggregate postings per month
agg = df.groupby("month", as_index=False).size().rename(columns={"size": "num_postings"})
agg["month"] = pd.Categorical(agg["month"], categories=month_order, ordered=True)
agg = agg.sort_values("month")

# Plot: Month vs. Number of Job Postings
chart = (
    alt.Chart(agg, title="Total Number of Job Postings per Month")
      .mark_line(point=True)
      .encode(
          x=alt.X("month:N", title="Month", sort=month_order, axis=alt.Axis(labelAngle=-45)),
          y=alt.Y("num_postings:Q", title="Number of Job Postings"),
          tooltip=["month:N", "num_postings:Q"]
      )
      .properties(width=800, height=400)
    #   .save("../visualizations/v1_monthly_job_posting_trends.png")
)
chart



## Visualization 2 — Top 10 Most In-Demand Skills

**Goal:**  
Identify the most frequently requested technical skills across job postings.

**Insights:**  
- **Python** and **Machine Learning** lead the rankings, followed by **SQL** and **R**, emphasizing the importance of programming and analytics tools.  
- Cloud and big-data skills (**AWS**, **Spark**, **Azure**) remain valuable but less dominant.

In [3]:
# Load
df = pd.read_csv("../data/data_science_job_posts_2025_clean.csv", low_memory=False)

# Explode → normalize → count top 10
skills = (
    df.assign(skills=df["skills"].apply(lambda s: [x.strip(" '\"") for x in s.strip("[]").split(",")]))
      .explode("skills")
)

counts = (
    skills["skills"].str.lower().str.strip()
    .value_counts()
    .rename_axis("skill").reset_index(name="num_postings")
    .head(10)
)

# Chart

chart = (
    alt.Chart(counts, title="Skill Demand Ranking")
    .mark_bar()
    .encode(
        x=alt.X("skill:N", title="Technical Skill", sort="-y", axis=alt.Axis(labelAngle=-45)),
        y=alt.Y("num_postings:Q", title="Count in Job Postings"),
        tooltip=["skill:N", "num_postings:Q"]
    )
    .properties(width=800, height=400)
    # .save("../visualizations/v2_top10_demand_skills.png")
)

chart


## Visualization 3 — Skill Demand by Industry and Role Level

**Goal:**  
Compare how skill demand varies across industries and seniority levels.

**Insights:**  
- The **Technology** and **Retail** industries exhibit sharp increases at senior levels, showing growing specialization demands.  
- Other industries such as **Finance** and **Healthcare** maintain steadier skill expectations across roles.  

In [4]:
# Count mentions per (industry, role)
counts = (skills.groupby(["industry", "seniority_level"]).size()
          .reset_index(name="num_skills"))
top5 = counts.groupby("industry")["num_skills"].sum().nlargest(5).index
counts = counts[counts["industry"].isin(top5)]
counts["seniority_level"] = pd.Categorical(counts["seniority_level"],  ordered=True)

# Chart
highlight = alt.selection_point(fields=["industry"], bind="legend")

chart = (
    alt.Chart(counts, title="Skill Demand by Industry & Role Level")
    .mark_line(point=True, strokeWidth=2)
    .encode(
        x=alt.X("seniority_level:N", title="Role level"),
        y=alt.Y("num_skills:Q", title="Count of skills in job postings"),
        color=alt.Color("industry:N", title="Industry"),
        opacity=alt.condition(highlight, alt.value(1), alt.value(0.15)),
        tooltip=["industry:N", "seniority_level:N", "num_skills:Q"]
    )
    .add_params(highlight)
    .properties(width=800, height=400)
    # .save("../visualizations/v3_skill_demand_by_industry_and_role.png")
)
chart


## Visualization 4 — Median Salary by U.S. State

**Goal:**  
Visualize spatial salary distribution across the United States.

**Insights:**  
- Higher median salaries are concentrated in **Western** and **Northeastern** states.  
- Central and Southeastern regions generally offer lower pay ranges, indicating geographic disparities in compensation. 

In [5]:
state_med = (
    df.groupby(["state", "fips_int"], as_index=False)
      .agg(median_salary=("salary_mid", "median"),
           n=("salary_mid", "count"))
      .dropna(subset=["fips_int"])
)

# Choropleth map
(
    alt.Chart(alt.topo_feature(vega_data.us_10m.url, "states"))
      .mark_geoshape(stroke="white", strokeWidth=0.5)
      .transform_lookup(
          lookup="id",
          from_=alt.LookupData(state_med, key="fips_int", fields=["state", "median_salary", "n"])
      )
      .encode(
          color=alt.Color("median_salary:Q", title="Median Salary (USD)",
                          scale=alt.Scale(scheme="blues", domainMin=80000, domainMax=200000)),
          tooltip=[
              "state:N",
              alt.Tooltip("median_salary:Q", title="Median", format="$.0f"),
              alt.Tooltip("n:Q", title="# postings")
          ]
      )
      .project(type="albersUsa")
      .properties(width=800, height=400, title="Median Salary by State")
    #   .save("../visualizations/v4_median_salary_by_state.png")
)