# Chapter 4 -- Data Visualisation
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH04_Data_Visualization.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 2 -- Data Science Foundations  
**Prerequisites:** Chapter 3 (NumPy and Pandas)  
**Estimated time:** 4-5 hours

---

### Learning Objectives

By the end of this chapter you will be able to:

- Build publication-quality plots with Matplotlib: line, scatter, bar, and histogram
- Customise figures with titles, labels, legends, styles, and subplots
- Use Seaborn for statistical visualisation: distributions, relationships, and categorical plots
- Interpret what box plots, violin plots, and pair plots actually show statistically
- Build correlation heatmaps and annotate them meaningfully
- Create interactive charts with Plotly Express
- Choose the right chart type for a given question

---

### Project Thread -- Chapter 4

Every chart in this chapter uses the SO 2025 dataset. We build a complete
Exploratory Data Analysis (EDA) suite: salary distributions, language popularity,
AI tool adoption, geographic salary maps, and the relationship between experience
and compensation. By the end you will have a visual picture of the 2025 developer
landscape that motivates every modelling decision in Part 3.


---

## Setup -- Imports, Style, and Data


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Install plotly if not already present (Colab may need this)
try:
    import plotly.express as px
    import plotly.graph_objects as go
except ImportError:
    import subprocess
    subprocess.run(['pip', 'install', 'plotly', '-q'])
    import plotly.express as px
    import plotly.graph_objects as go

print(f'Matplotlib: {plt.matplotlib.__version__}')
print(f'Seaborn:    {sns.__version__}')
print(f'Plotly:     {px.__version__}')

# Set a clean, consistent style for all Matplotlib/Seaborn charts
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('deep')
plt.rcParams['figure.dpi']    = 110
plt.rcParams['font.size']     = 11
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.titleweight'] = 'bold'

DATASET_URL = 'https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv'


In [None]:
# Load and clean the SO 2025 dataset -- same pipeline as Chapter 3
df_raw = pd.read_csv(DATASET_URL)

df = df_raw.copy()

# Salary cleaning
df = df.dropna(subset=['ConvertedCompYearly'])
df['ConvertedCompYearly'] = pd.to_numeric(df['ConvertedCompYearly'], errors='coerce')
Q1, Q3 = df['ConvertedCompYearly'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[
    (df['ConvertedCompYearly'] >= max(Q1 - 3*IQR, 5_000)) &
    (df['ConvertedCompYearly'] <= min(Q3 + 3*IQR, 600_000))
].copy()

# Years experience
if 'YearsCodePro' in df.columns:
    df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')
    df['YearsCodePro'] = df['YearsCodePro'].fillna(df['YearsCodePro'].median())

# Fill missing categoricals
for col in ['Country', 'DevType', 'EdLevel', 'RemoteWork', 'OrgSize']:
    if col in df.columns:
        df[col] = df[col].fillna('Unknown')

# Convenience columns
df['uses_python'] = df.get('LanguageHaveWorkedWith', pd.Series(dtype=str)).str.contains('Python', na=False)
df['salary_band'] = pd.cut(
    df['ConvertedCompYearly'],
    bins=[0, 60_000, 100_000, 150_000, 200_000, float('inf')],
    labels=['Junior', 'Mid-level', 'Senior', 'Senior+', 'Principal'],
    right=False
)

df = df.reset_index(drop=True)
print(f'Dataset ready: {len(df):,} rows x {df.shape[1]} columns')
print(f'Salary: ${df["ConvertedCompYearly"].median():,.0f} median, ${df["ConvertedCompYearly"].mean():,.0f} mean')


---

## Section 4.1 -- Matplotlib: The Foundation

Matplotlib is the base layer of Python visualisation. Every other library
(Seaborn, Pandas plotting, even some Plotly features) builds on top of it.

The core mental model:
- A **Figure** is the entire canvas -- the outer container
- An **Axes** (ax) is one individual plot within the figure
- Most customisation happens by calling methods on `ax`

The standard pattern you will write hundreds of times:
```python
fig, ax = plt.subplots(figsize=(width, height))
ax.plot(...)        # or ax.scatter, ax.bar, ax.hist ...
ax.set_title(...)
ax.set_xlabel(...)
plt.tight_layout()
plt.show()
```


In [None]:
# 4.1.1 -- Histogram: salary distribution

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sal = df['ConvertedCompYearly']

# Left: linear scale
axes[0].hist(sal, bins=60, color='#2E75B6', edgecolor='white', linewidth=0.4)
axes[0].axvline(sal.median(), color='#E8722A', linestyle='--', linewidth=2,
                label=f'Median ${sal.median():,.0f}')
axes[0].axvline(sal.mean(),   color='green',   linestyle=':',  linewidth=2,
                label=f'Mean   ${sal.mean():,.0f}')
axes[0].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
axes[0].set_title('Salary Distribution (Linear Scale)')
axes[0].set_xlabel('Annual Salary (USD)')
axes[0].set_ylabel('Number of Respondents')
axes[0].legend(fontsize=10)

# Right: log scale -- reveals the true shape of a right-skewed distribution
# Log scale compresses the long right tail, making the bulk of the distribution visible
import numpy as np
log_bins = np.logspace(np.log10(sal.min()+1), np.log10(sal.max()), 60)
axes[1].hist(sal, bins=log_bins, color='#2E75B6', edgecolor='white', linewidth=0.4)
axes[1].set_xscale('log')
axes[1].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
axes[1].set_title('Salary Distribution (Log Scale)')
axes[1].set_xlabel('Annual Salary (USD, log scale)')
axes[1].set_ylabel('Number of Respondents')

fig.suptitle('SO 2025: Developer Salary Distribution', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print('The log scale reveals the near-normal shape hidden by the right skew.')
print('Most modelling tasks log-transform salary before training for this reason.')


In [None]:
# 4.1.2 -- Bar chart: top programming languages

if 'LanguageHaveWorkedWith' in df.columns:
    lang_counts = (
        df['LanguageHaveWorkedWith']
        .dropna()
        .str.split(';')
        .explode()
        .str.strip()
        .value_counts()
        .head(15)
    )

    fig, ax = plt.subplots(figsize=(11, 6))

    # Colour Python distinctively -- it is the subject of this book
    colors = ['#E8722A' if lang == 'Python' else '#2E75B6'
              for lang in lang_counts.index[::-1]]

    bars = ax.barh(lang_counts.index[::-1], lang_counts.values[::-1],
                   color=colors, edgecolor='white', linewidth=0.5)

    # Add value labels at the end of each bar
    for bar, val in zip(bars, lang_counts.values[::-1]):
        pct = val / len(df) * 100
        ax.text(val + 30, bar.get_y() + bar.get_height()/2,
                f'{val:,}  ({pct:.1f}%)', va='center', ha='left', fontsize=9)

    ax.set_xlabel('Number of Respondents')
    ax.set_title('SO 2025: Top 15 Programming Languages\n(% of survey respondents who used each language)')
    ax.set_xlim(0, lang_counts.max() * 1.25)   # extra room for labels
    plt.tight_layout()
    plt.show()


In [None]:
# 4.1.3 -- Scatter plot: experience vs salary

if 'YearsCodePro' in df.columns:
    # Sample for readability -- plotting 15k overlapping points is hard to read
    sample = df[['YearsCodePro', 'ConvertedCompYearly', 'uses_python']].dropna().sample(
        n=min(2000, len(df)), random_state=42
    )

    fig, ax = plt.subplots(figsize=(11, 6))

    # Plot non-Python and Python users separately with different colours
    for uses_py, colour, label in [(False, '#AAAAAA', 'Other languages'),
                                    (True,  '#E8722A', 'Python user')]:
        mask = sample['uses_python'] == uses_py
        ax.scatter(
            sample.loc[mask, 'YearsCodePro'],
            sample.loc[mask, 'ConvertedCompYearly'],
            c=colour, alpha=0.4, s=18, label=label, linewidths=0
        )

    # Add a trend line using numpy polynomial fit
    valid = sample.dropna()
    z = np.polyfit(valid['YearsCodePro'], valid['ConvertedCompYearly'], deg=1)
    p = np.poly1d(z)
    x_line = np.linspace(valid['YearsCodePro'].min(), valid['YearsCodePro'].max(), 100)
    ax.plot(x_line, p(x_line), color='#1B3A5C', linewidth=2.5,
            label=f'Trend (slope: ${z[0]:,.0f}/yr)')

    ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    ax.set_xlabel('Years of Professional Coding Experience')
    ax.set_ylabel('Annual Salary (USD)')
    ax.set_title('SO 2025: Experience vs Salary\n(2,000-respondent sample)')
    ax.legend(fontsize=10)
    plt.tight_layout()
    plt.show()

    corr = valid['YearsCodePro'].corr(valid['ConvertedCompYearly'])
    print(f'Pearson correlation (experience vs salary): {corr:.3f}')
    print(f'Linear trend: each additional year of experience -> +${z[0]:,.0f} salary')


In [None]:
# 4.1.4 -- Subplots grid: salary by key categorical variables

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: respondents by salary band
if 'salary_band' in df.columns:
    band_counts = df['salary_band'].value_counts().sort_index()
    axes[0].bar(band_counts.index.astype(str), band_counts.values,
                color='#2E75B6', edgecolor='white')
    axes[0].set_title('Respondents by Salary Band')
    axes[0].set_xlabel('Salary Band')
    axes[0].set_ylabel('Count')
    for i, v in enumerate(band_counts.values):
        axes[0].text(i, v + 20, f'{v:,}', ha='center', fontsize=9)

# Plot 2: respondents by remote work status
if 'RemoteWork' in df.columns:
    rw = df['RemoteWork'].value_counts().head(4)
    axes[1].bar(rw.index, rw.values, color='#1F7A8C', edgecolor='white')
    axes[1].set_title('Respondents by Remote Work Status')
    axes[1].set_xlabel('Remote Work')
    axes[1].set_ylabel('Count')
    axes[1].tick_params(axis='x', rotation=20)

# Plot 3: Python vs non-Python salary comparison
if 'uses_python' in df.columns:
    py_sal    = df[df['uses_python']]['ConvertedCompYearly']
    nopy_sal  = df[~df['uses_python']]['ConvertedCompYearly']
    labels    = ['Python users', 'Non-Python']
    medians   = [py_sal.median(), nopy_sal.median()]
    counts    = [len(py_sal), len(nopy_sal)]
    bars = axes[2].bar(labels, medians, color=['#E8722A', '#2E75B6'], edgecolor='white')
    axes[2].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    axes[2].set_title('Median Salary: Python vs Non-Python')
    axes[2].set_ylabel('Median Salary (USD)')
    for bar, med, n in zip(bars, medians, counts):
        axes[2].text(bar.get_x() + bar.get_width()/2, med + 1000,
                     f'${med/1000:.0f}k\nn={n:,}', ha='center', fontsize=10)

fig.suptitle('SO 2025: Salary by Key Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


---

## Section 4.2 -- Seaborn: Statistical Visualisation

Seaborn wraps Matplotlib with a higher-level API designed for statistical plots.
It integrates natively with Pandas DataFrames -- you pass column names as strings
rather than extracting arrays manually.

Seaborn's key advantage: it shows distributions and relationships, not just values.
A bar chart shows one number per group. A box plot shows the full distribution.
That difference matters enormously for understanding data before modelling.


In [None]:
# 4.2.1 -- Distribution plots: histplot and kdeplot

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# histplot with KDE overlay
sns.histplot(
    data=df, x='ConvertedCompYearly',
    bins=50, kde=True,          # kde=True adds the smooth density curve
    color='#2E75B6', ax=axes[0]
)
axes[0].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
axes[0].set_title('Salary Distribution with KDE')
axes[0].set_xlabel('Annual Salary (USD)')

# kdeplot comparing Python vs non-Python salary distributions
if 'uses_python' in df.columns:
    for uses_py, label, colour in [
        (True,  'Python users',  '#E8722A'),
        (False, 'Non-Python',    '#2E75B6')
    ]:
        subset = df[df['uses_python'] == uses_py]['ConvertedCompYearly']
        sns.kdeplot(subset, label=label, color=colour, linewidth=2.5, ax=axes[1])
    axes[1].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    axes[1].set_title('Salary KDE: Python vs Non-Python')
    axes[1].set_xlabel('Annual Salary (USD)')
    axes[1].legend()

plt.tight_layout()
plt.show()


In [None]:
# 4.2.2 -- Box plots and violin plots: understanding what they show
#
# Box plot anatomy:
#   - Box:     IQR (25th to 75th percentile) -- the middle 50% of the data
#   - Line:    Median (50th percentile)
#   - Whiskers: 1.5 * IQR from the box edges
#   - Points:  Outliers beyond the whiskers
#
# Violin plot: adds KDE density curves on both sides -- shows full distribution shape

top_countries = df['Country'].value_counts().head(6).index.tolist()
df_top = df[df['Country'].isin(top_countries)].copy()

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Box plot
order = (
    df_top.groupby('Country')['ConvertedCompYearly']
    .median().sort_values(ascending=False).index
)
sns.boxplot(
    data=df_top, x='Country', y='ConvertedCompYearly',
    order=order, palette='deep', ax=axes[0]
)
axes[0].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
axes[0].set_title('Salary Distribution by Country (Box Plot)\nBox=IQR, Line=Median, Points=Outliers')
axes[0].set_xlabel('')
axes[0].tick_params(axis='x', rotation=25)

# Violin plot
sns.violinplot(
    data=df_top, x='Country', y='ConvertedCompYearly',
    order=order, palette='deep', inner='quartile', ax=axes[1]
    # inner='quartile' draws quartile lines inside the violin
)
axes[1].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
axes[1].set_title('Salary Distribution by Country (Violin Plot)\nWidth = density of respondents at each salary level')
axes[1].set_xlabel('')
axes[1].tick_params(axis='x', rotation=25)

plt.tight_layout()
plt.show()

print('The violin plot reveals bimodal distributions that the box plot hides.')
print('Wide sections have many respondents; narrow sections have few.')


In [None]:
# 4.2.3 -- Categorical plot: salary by education level

if 'EdLevel' in df.columns:
    # Shorten long education level strings for readable axis labels
    ed_map = {
        "Bachelor's degree (B.A., B.S., B.Eng., etc.)": "Bachelor's",
        "Master's degree (M.A., M.S., M.Eng., MBA, etc.)": "Master's",
        'Some college/university study without earning a degree': 'Some college',
        'Associate degree (A.A., A.S., etc.)': 'Associate',
        'Primary/elementary school': 'Primary',
        'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)': 'Secondary',
        'Other doctoral degree (Ph.D., Ed.D., etc.)': 'PhD',
        'Professional degree (JD, MD, Ph.D, Ed.D, etc.)': 'Prof. degree',
        'Something else': 'Other',
    }
    df['ed_short'] = df['EdLevel'].replace(ed_map)

    ed_order = (
        df.groupby('ed_short')['ConvertedCompYearly']
        .median().sort_values(ascending=False).index
    )

    fig, ax = plt.subplots(figsize=(12, 6))
    sns.boxplot(
        data=df, x='ed_short', y='ConvertedCompYearly',
        order=ed_order, palette='Blues_d', ax=ax
    )
    ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    ax.set_title('SO 2025: Salary by Education Level\n(sorted by median salary)')
    ax.set_xlabel('')
    ax.set_ylabel('Annual Salary (USD)')
    ax.tick_params(axis='x', rotation=30)
    plt.tight_layout()
    plt.show()


In [None]:
# 4.2.4 -- Heatmap: correlation matrix
#
# A heatmap displays a matrix of values as colours.
# For a correlation matrix: values range from -1 to +1.
# +1 = perfect positive correlation (both go up together)
# -1 = perfect negative correlation (one goes up, other goes down)
#  0 = no linear relationship

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Keep only columns with enough non-null values and meaningful variance
numeric_cols = [c for c in numeric_cols
                if df[c].notna().sum() > len(df) * 0.5
                and df[c].std() > 0][:8]   # limit to 8 for readability

corr_matrix = df[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # hide upper triangle (redundant)
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,          # show numeric values in each cell
    fmt='.2f',           # 2 decimal places
    cmap='RdBu_r',       # red=positive, blue=negative, white=zero
    center=0,            # centre the colour scale at 0
    square=True,
    linewidths=0.5,
    cbar_kws={'shrink': 0.8},
    ax=ax
)
ax.set_title('SO 2025: Correlation Matrix of Numeric Features\n(lower triangle only -- upper is identical)')
plt.tight_layout()
plt.show()

# Find the strongest correlations with salary
if 'ConvertedCompYearly' in corr_matrix.columns:
    sal_corr = corr_matrix['ConvertedCompYearly'].drop('ConvertedCompYearly').sort_values(key=abs, ascending=False)
    print('Correlations with salary (strongest first):')
    for feat, corr in sal_corr.items():
        direction = 'positive' if corr > 0 else 'negative'
        print(f'  {feat:<25} {corr:>+.3f}  ({direction})')


---

## Section 4.3 -- Advanced Visualisation: Pair Plots, FacetGrid, and Plotly

This section covers three tools for exploring multi-dimensional relationships:

- **Pair plots** show all pairwise scatter plots and distributions in one grid
- **FacetGrid** repeats the same chart across subgroups for direct comparison
- **Plotly Express** creates interactive charts that users can zoom, hover, and filter


In [None]:
# 4.3.1 -- Pair plot: pairwise relationships
#
# A pair plot (scatter plot matrix) shows every variable plotted against
# every other variable. The diagonal shows each variable's own distribution.
# Use it for an instant overview of which variables are correlated.

pair_cols = ['ConvertedCompYearly', 'YearsCodePro']
pair_cols = [c for c in pair_cols if c in df.columns]

if len(pair_cols) >= 2 and 'uses_python' in df.columns:
    sample = df[pair_cols + ['uses_python']].dropna().sample(
        n=min(800, len(df)), random_state=42
    )
    sample['Language Group'] = sample['uses_python'].map({True: 'Python', False: 'Other'})

    g = sns.pairplot(
        sample[pair_cols + ['Language Group']],
        hue='Language Group',
        palette={'Python': '#E8722A', 'Other': '#2E75B6'},
        plot_kws={'alpha': 0.4, 's': 15},
        diag_kind='kde'
    )
    g.figure.suptitle('SO 2025: Pair Plot -- Salary and Experience', y=1.02, fontsize=13)
    plt.show()


In [None]:
# 4.3.2 -- FacetGrid: same chart across multiple subgroups
#
# FacetGrid creates a grid of axes -- each cell shows the same plot
# for a different subset of the data. Perfect for comparing distributions
# across categories without manually writing a loop.

if 'RemoteWork' in df.columns:
    top_remote = df['RemoteWork'].value_counts().head(3).index.tolist()
    df_remote  = df[df['RemoteWork'].isin(top_remote)].copy()

    g = sns.FacetGrid(
        df_remote,
        col='RemoteWork',       # one column per remote work category
        col_wrap=3,
        height=4, aspect=1.2,
        sharey=False
    )
    g.map_dataframe(
        sns.histplot,
        x='ConvertedCompYearly', bins=40, color='#2E75B6', kde=True
    )
    for ax in g.axes.flat:
        ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    g.set_titles(col_template='{col_name}')
    g.set_axis_labels('Annual Salary (USD)', 'Count')
    g.figure.suptitle('SO 2025: Salary Distribution by Remote Work Status',
                      y=1.04, fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.show()


In [None]:
# 4.3.3 -- Plotly Express: interactive charts
#
# Plotly creates interactive HTML charts -- hover to see values,
# click legend items to show/hide groups, zoom by dragging.
# In Colab these render inline. Use for exploratory work and presentations.

top5 = df['Country'].value_counts().head(5).index.tolist()
df_p5 = df[df['Country'].isin(top5)].copy()

# Interactive box plot
fig = px.box(
    df_p5,
    x='Country', y='ConvertedCompYearly',
    color='Country',
    title='SO 2025: Interactive Salary Distribution by Country<br><sup>Hover for exact values, click legend to hide/show</sup>',
    labels={'ConvertedCompYearly': 'Annual Salary (USD)', 'Country': ''},
    category_orders={'Country': list(
        df_p5.groupby('Country')['ConvertedCompYearly']
        .median().sort_values(ascending=False).index
    )}
)
fig.update_layout(height=500, showlegend=False)
fig.update_yaxes(tickformat='$,.0f')
fig.show()


In [None]:
# 4.3.4 -- Plotly scatter: experience vs salary, interactive

if 'YearsCodePro' in df.columns and 'uses_python' in df.columns:
    sample = df[['YearsCodePro', 'ConvertedCompYearly', 'uses_python',
                 'Country', 'salary_band']].dropna().sample(
        n=min(1500, len(df)), random_state=42
    )
    sample['Language Group'] = sample['uses_python'].map({True: 'Python', False: 'Other'})
    sample['salary_band'] = sample['salary_band'].astype(str)

    fig = px.scatter(
        sample,
        x='YearsCodePro',
        y='ConvertedCompYearly',
        color='Language Group',
        hover_data=['Country', 'salary_band'],
        color_discrete_map={'Python': '#E8722A', 'Other': '#2E75B6'},
        title='SO 2025: Experience vs Salary (Interactive)<br><sup>Hover points for details</sup>',
        labels={
            'YearsCodePro': 'Years of Professional Coding',
            'ConvertedCompYearly': 'Annual Salary (USD)'
        },
        opacity=0.5
    )
    fig.update_layout(height=520)
    fig.update_yaxes(tickformat='$,.0f')
    fig.show()


In [None]:
# 4.3.5 -- AI tool adoption heatmap

if 'AIToolCurrently' in df.columns and 'Country' in df.columns:
    top8_countries = df['Country'].value_counts().head(8).index.tolist()

    # Get top AI tools
    all_tools = (
        df['AIToolCurrently'].dropna()
        .str.split(';').explode().str.strip()
        .value_counts().head(6).index.tolist()
    )

    # Build adoption matrix: rows=countries, cols=tools, values=% adoption
    rows = []
    for country in top8_countries:
        country_df = df[df['Country'] == country]
        row = {'Country': country}
        for tool in all_tools:
            uses_tool = country_df['AIToolCurrently'].str.contains(tool, na=False)
            row[tool] = uses_tool.mean() * 100   # % of respondents in that country
        rows.append(row)

    heat_df = pd.DataFrame(rows).set_index('Country')

    fig, ax = plt.subplots(figsize=(12, 6))
    sns.heatmap(
        heat_df,
        annot=True, fmt='.0f',
        cmap='YlOrRd',
        linewidths=0.5,
        cbar_kws={'label': '% of respondents using tool'},
        ax=ax
    )
    ax.set_title('SO 2025: AI Tool Adoption by Country (%)',
                 fontsize=13, fontweight='bold')
    ax.set_xlabel('AI Tool')
    ax.set_ylabel('Country')
    plt.tight_layout()
    plt.show()


In [None]:
# 4.3.6 -- Full EDA summary: salary by developer type

if 'DevType' in df.columns:
    # DevType is multi-value -- take the first role listed per respondent
    df['primary_role'] = df['DevType'].str.split(';').str[0].str.strip()

    top_roles = df['primary_role'].value_counts().head(8).index.tolist()
    df_roles  = df[df['primary_role'].isin(top_roles)]

    role_order = (
        df_roles.groupby('primary_role')['ConvertedCompYearly']
        .median().sort_values(ascending=False).index
    )

    fig, ax = plt.subplots(figsize=(13, 6))
    sns.boxplot(
        data=df_roles,
        x='primary_role', y='ConvertedCompYearly',
        order=role_order,
        palette='deep', ax=ax
    )
    ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
    ax.set_title('SO 2025: Salary Distribution by Developer Type\n(sorted by median; box=IQR, line=median)',
                 fontsize=13)
    ax.set_xlabel('')
    ax.set_ylabel('Annual Salary (USD)')
    ax.tick_params(axis='x', rotation=30)

    # Annotate with median value
    for i, role in enumerate(role_order):
        med = df_roles[df_roles['primary_role'] == role]['ConvertedCompYearly'].median()
        ax.text(i, med + 3000, f'${med/1000:.0f}k', ha='center', fontsize=9, fontweight='bold')

    plt.tight_layout()
    plt.show()


---

## Chapter 4 Summary

Chapter 4 completes Part 2. You now have the full EDA picture of the 2025 developer landscape
and the visualisation toolkit to explore any dataset effectively.

### Key Takeaways

- The **Figure/Axes** mental model: `fig, ax = plt.subplots()` is the pattern for everything.
  Customise via `ax.set_title()`, `ax.set_xlabel()`, `ax.xaxis.set_major_formatter()`.
- **Log scale** is often essential for salary and revenue data -- it reveals the true
  distribution shape hidden by right skew.
- **Box plots** show IQR, median, and outliers. **Violin plots** also show density.
  Choose violin when distribution shape matters; box when you want clean comparisons.
- **Correlation heatmaps** should mask the upper triangle (redundant) and use a
  diverging colour scale centred at zero.
- **Pair plots** are powerful for initial exploration but slow with > 5 variables
  or > 2,000 rows -- always sample first.
- **FacetGrid** is the clean way to compare distributions across categories.
- **Plotly Express** adds interactivity with almost no extra code -- use `px.scatter`,
  `px.box`, `px.histogram` as drop-in replacements for static equivalents.
- **Chart type selection rule:** histogram for one variable distribution,
  scatter for two continuous variables, box/violin for one continuous + one categorical,
  heatmap for a matrix of values, bar for counts or simple comparisons.

### Project Thread Status

| Visualisation | Status |
|---------------|--------|
| Salary histogram (linear + log scale) | Done |
| Top-15 language bar chart | Done |
| Experience vs salary scatter with trend line | Done |
| Salary by country -- box and violin plots | Done |
| Salary by education level | Done |
| Numeric correlation heatmap | Done |
| AI tool adoption heatmap by country | Done |
| Interactive Plotly salary box and scatter | Done |
| Salary by developer type (annotated box plot) | Done |

---

### What's Next: Chapter 5 -- SciPy and Statistical Computing

Chapter 5 moves from visualisation to formal statistical analysis.
We run hypothesis tests on the SO 2025 data -- is the salary difference between
Python and non-Python developers statistically significant? Does education level
have a measurable effect on salary? These questions set up the feature engineering
decisions in Chapter 6.

---

*End of Chapter 4 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
