- Overview
- Part 01: Basic Visualizations
- Part 02: Geographic Visualizations
- Part 03: Statistical Visualizations
- Part 04: 3D Visualizations
- Part 05: Missing Data Visualization
- Comparison Table
- Best Practices
- Quick Reference
This documentation provides a comprehensive guide to five enhanced Jupyter notebooks designed for machine learning and data science visualization. Each notebook progressively builds upon visualization concepts, from basic plots to advanced 3D visualizations and data quality assessment.
These notebooks serve as:
- Educational Resource: Step-by-step tutorials for beginners
- Reference Guide: Quick lookup for visualization techniques
- Best Practices: Production-ready code examples
- Portfolio Projects: Demonstrable data science skills
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.7+ | Core language |
| Pandas | 1.0+ | Data manipulation |
| NumPy | 1.18+ | Numerical operations |
| Matplotlib | 3.1+ | Static visualizations |
| Seaborn | 0.10+ | Statistical plots |
| Plotly | 4.0+ | Interactive visualizations |
graph TD
A[Data Loading] --> B[Data Exploration]
B --> C[Data Preprocessing]
C --> D{Visualization Type}
D -->|Basic| E[Part 01: Scatter, Bar, Line]
D -->|Geographic| F[Part 02: Maps, Choropleth]
D -->|Statistical| G[Part 03: Distributions, Correlations]
D -->|Advanced| H[Part 04: 3D, Multi-dimensional]
D -->|Quality| I[Part 05: Missing Data, Binning]
E --> J[Insights & Interpretation]
F --> J
G --> J
H --> J
I --> J
File: machine-learning-visualization-part-1.ipynb
File on Kaggle: Kaggle link File on Github: Github link
Focus: Fundamental visualization techniques using Matplotlib, Seaborn, and Plotly.
Code Cells: 80 | Markdown Cells: 18
- Load and explore datasets
- Create basic scatter plots
- Build bar charts and categorical visualizations
- Understand marginal distributions
- Master plot customization
flowchart LR
A[Load Dataset] --> B[Data Exploration]
B --> C[Scatter Plots]
C --> D[Bar Charts]
D --> E[Line Plots]
E --> F[Marginal Distributions]
F --> G[Customization]
G --> H[Export & Share]
| Feature | Description | Library | Complexity |
|---|---|---|---|
| Scatter Plots | Relationship between 2 variables | Matplotlib/Plotly | β Basic |
| Bar Charts | Categorical comparisons | Matplotlib/Seaborn | β Basic |
| Line Plots | Trends over time/sequence | Matplotlib | β Basic |
| Histograms | Distribution visualization | Seaborn | ββ Intermediate |
| Box Plots | Statistical summaries | Seaborn | ββ Intermediate |
| Marginal Plots | Combined distributions | Plotly | βββ Advanced |
# Standard data loading pattern
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv('dataset.csv')
# Initial exploration
df.head()
df.info()
df.describe()Purpose: Understand data structure, types, and basic statistics.
Techniques Covered:
- Basic scatter plot
- Colored by category
- Sized by variable
- With trend lines
- Interactive with Plotly
When to Use:
- Exploring relationships between two continuous variables
- Identifying correlations
- Detecting outliers
Variations:
- Vertical/horizontal bars
- Grouped bars
- Stacked bars
- Percentage bars
Best For:
- Comparing categories
- Showing rankings
- Displaying distributions across groups
Combines:
- Central scatter plot
- Marginal histograms/box plots on axes
- Statistical overlays
Value: Shows both individual variable distributions AND their relationship.
| Aspect | Options | Code Example |
|---|---|---|
| Colors | Named, hex, RGB, colormaps | color='red', cmap='viridis' |
| Markers | Shapes and sizes | marker='o', s=100 |
| Labels | Titles, axes, legends | plt.title(), plt.xlabel() |
| Style | Themes and presets | sns.set_style('darkgrid') |
| Layout | Subplots, grids | plt.subplot(2,2,1) |
- Always explore data first: Use
.info(),.describe(),.head() - Handle missing values: Before visualizing
- Choose appropriate plot types: Match visualization to data type
- Label everything: Axes, titles, legends
- Use color purposefully: Not just for aesthetics
- Consider accessibility: Color-blind friendly palettes
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
# Create visualization
sns.scatterplot(data=df, x='feature1', y='feature2',
hue='category', size='value', alpha=0.6)
# Customize
plt.title('Feature Relationship Analysis', fontsize=16)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1))
# Display
plt.tight_layout()
plt.show()File: machine-learning-visualization-part-2.ipynb
File on Kaggle: Kaggle link File on Github: Github link
Focus: Interactive geographic visualizations using Plotly and mapping techniques.
Code Cells: 12 | Markdown Cells: 18
- Create choropleth maps
- Master Plotly Express for interactive plots
- Perform geocoding (location β coordinates)
- Build animated time-series maps
- Visualize spatial distributions
flowchart TD
A[Geographic Data] --> B{Has Coordinates?}
B -->|Yes| C[Direct Mapping]
B -->|No| D[Geocoding]
D --> E[Get Lat/Long]
E --> C
C --> F{Map Type}
F -->|Regions| G[Choropleth Map]
F -->|Points| H[Scatter Geo]
F -->|Density| I[Density Mapbox]
G --> J[Add Interactivity]
H --> J
I --> J
J --> K[Time Animation]
K --> L[Final Map]
| Visualization | Purpose | Interactivity | Best Use Case |
|---|---|---|---|
| Choropleth Map | Color-coded regions | Hover, zoom, pan | Country/state comparisons |
| Scatter Geo | Points on map | Click, hover | City locations, events |
| Density Map | Heat mapping | Zoom, filter | Population density, hotspots |
| Animated Map | Time-series | Play/pause, slider | Data evolution over time |
| Line Map | Routes/connections | Hover paths | Migration, trade routes |
Libraries:
plotly.express: High-level interactive plotsplotly.graph_objects: Low-level customizationgeocoder: Location to coordinates conversion
Dataset: Heart Disease Dataset (with geographic augmentation)
Key Operations:
# Load data
df = pd.read_csv('heart.csv')
# Check structure
print(df.shape)
df.head()What is Gapminder?
- Historical statistics (GDP, life expectancy, population)
- Multiple countries and years
- Perfect for animated visualizations
Loading:
gapminder = px.data.gapminder()Purpose: Convert location names to latitude/longitude
Example:
import geocoder
# Get coordinates for a location
g = geocoder.osm('New York City')
lat, lng = g.latlngUse Cases:
- Customer locations
- Store addresses
- Event venues
Code Pattern:
import plotly.express as px
fig = px.choropleth(
df,
locations='country_code', # ISO country codes
color='value', # Color by this column
hover_name='country', # Show on hover
color_continuous_scale='Viridis',
title='World Data Visualization'
)
fig.show()Key Parameters:
| Parameter | Description | Example Values |
|---|---|---|
locations |
Geographic identifiers | ISO codes, state names |
locationmode |
Type of location | 'ISO-3', 'USA-states' |
color |
Data for coloring | Any numeric column |
scope |
Map region | 'world', 'usa', 'europe' |
projection |
Map projection | 'natural earth', 'orthographic' |
Creating Time-Series Animations:
fig = px.choropleth(
gapminder,
locations='iso_alpha',
color='lifeExp',
hover_name='country',
animation_frame='year', # Animate by year
animation_group='country',
color_continuous_scale='Plasma',
title='Life Expectancy Over Time'
)
fig.show()Controls:
- Play/Pause button
- Year slider
- Speed adjustment
fig.update_layout(
geo=dict(
showframe=False,
showcoastlines=True,
projection_type='natural earth'
),
title=dict(
text='Custom Title',
x=0.5,
font=dict(size=20, color='darkblue')
)
)- Use appropriate projections: Natural Earth for world, Albers for USA
- Choose color scales wisely: Sequential for continuous, categorical for discrete
- Include hover information: Make maps informative
- Test geocoding results: Verify coordinates before plotting
- Optimize for performance: Limit data points for smooth interaction
- Consider map context: Show coastlines, borders as needed
Multi-Layer Maps:
# Combine choropleth with scatter points
fig = go.Figure()
# Add choropleth layer
fig.add_trace(go.Choropleth(...))
# Add scatter points
fig.add_trace(go.Scattergeo(...))
fig.show()File: machine-learning-visualization-part-3.ipynb
File on Kaggle: Kaggle link File on Github: Github link
Focus: Statistical analysis through visualization using Seaborn.
Code Cells: 5 | Markdown Cells: 6
- Create and interpret joint plots
- Visualize distributions effectively
- Compare distributions across categories
- Build correlation heatmaps
- Use pair plots for multi-variable analysis
flowchart LR
A[Loaded Data] --> B[Univariate Analysis]
B --> C[Distribution Plots]
A --> D[Bivariate Analysis]
D --> E[Joint Plots]
D --> F[Regression Plots]
A --> G[Multivariate Analysis]
G --> H[Pair Plots]
G --> I[Heatmaps]
C --> J[Insights]
E --> J
F --> J
H --> J
I --> J
| Plot Type | Purpose | Shows | Seaborn Function |
|---|---|---|---|
| Joint Plot | Bivariate + distributions | 2 variables + margins | sns.jointplot() |
| Distribution Plot | Data spread | Histogram + KDE | sns.displot() |
| Box Plot | Statistical summary | Quartiles, outliers | sns.boxplot() |
| Violin Plot | Distribution shape | Density + quartiles | sns.violinplot() |
| Heatmap | Matrix visualization | Correlations, patterns | sns.heatmap() |
| Pair Plot | Multiple relationships | All variable pairs | sns.pairplot() |
Core Libraries:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import statsConfiguration:
# Set style for better aesthetics
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)Dataset: Heart Disease Dataset
Initial Exploration:
- Shape and structure
- Data types
- Missing values
- Basic statistics
What is a Joint Plot?
A joint plot combines:
- Central plot: Scatter, hexbin, or KDE of two variables
- Marginal plots: Distribution of each variable on the axes
Types:
| Kind | Central Plot | Use Case |
|---|---|---|
scatter |
Scatter plot | Individual data points |
reg |
Scatter + regression | Linear relationships |
hex |
Hexbin density | Large datasets |
kde |
2D density | Smooth distributions |
hist |
2D histogram | Binned counts |
Code Example:
# Basic joint plot
sns.jointplot(data=df, x='age', y='chol', kind='scatter')
# With regression
sns.jointplot(data=df, x='age', y='chol', kind='reg',
color='steelblue', height=8)
# KDE joint plot
sns.jointplot(data=df, x='age', y='chol', kind='kde',
fill=True, cmap='Blues')Understanding Regression Lines:
- Shows linear trend
- Confidence interval (shaded area)
- Pearson correlation coefficient
Interpretation:
# Calculate correlation
from scipy.stats import pearsonr
corr, p_value = pearsonr(df['age'], df['chol'])
print(f'Correlation: {corr:.3f}, P-value: {p_value:.4f}')Statistical Significance:
- p < 0.05: Significant relationship
- p β₯ 0.05: No significant relationship
Visualizing Single Variables:
# Histogram with KDE
sns.histplot(data=df, x='age', kde=True, bins=30)
# Distribution plot with hue
sns.displot(data=df, x='age', hue='target',
kind='kde', fill=True, alpha=0.5)Options:
kde=True: Add kernel density estimatehue: Separate by categorybins: Number of histogram binsfill: Fill KDE area
Box Plot Structure:
Max (or Q3 + 1.5*IQR)
ββββββ
β
Q3 βββ€
β β IQR (Interquartile Range)
Q2 βββ€ β Median
β
Q1 βββ€
β
ββββββ
Min (or Q1 - 1.5*IQR)
β’ β Outliers
Code Examples:
# Box plot
sns.boxplot(data=df, x='target', y='age')
# Violin plot (shows distribution shape)
sns.violinplot(data=df, x='target', y='age',
split=True, inner='quartile')
# Grouped comparison
sns.boxplot(data=df, x='cp', y='chol', hue='target')Purpose: Visualize relationships between all numeric variables
Code Pattern:
# Calculate correlation matrix
corr_matrix = df.corr()
# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix,
annot=True, # Show values
fmt='.2f', # Format to 2 decimals
cmap='coolwarm', # Color scheme
center=0, # Center colormap at 0
square=True, # Square cells
linewidths=1, # Grid lines
cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()Interpreting Correlations:
| Value Range | Interpretation |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.5 to 0.7 | Moderate positive |
| 0.3 to 0.5 | Weak positive |
| -0.3 to 0.3 | Negligible |
| -0.5 to -0.3 | Weak negative |
| -0.7 to -0.5 | Moderate negative |
| -0.9 to -0.7 | Strong negative |
| -1.0 to -0.9 | Very strong negative |
Multi-Variable Exploration:
# Basic pair plot
sns.pairplot(df)
# With categorical coloring
sns.pairplot(df, hue='target', palette='Set2',
diag_kind='kde', # Diagonal plots
plot_kws={'alpha': 0.6})What It Shows:
- Diagonal: Distribution of each variable
- Off-diagonal: Scatter plots between variable pairs
- Colored by category (with
hue)
When to Use:
- Initial data exploration
- Feature selection
- Identifying patterns across multiple variables
- Check assumptions: Linearity, normality for appropriate tests
- Handle outliers: Identify and treat before statistical analysis
- Choose appropriate plots: Match plot to data distribution
- Report statistics: Include correlation coefficients, p-values
- Use appropriate color scales: Diverging for correlations
- Consider sample size: Some plots need sufficient data points
P-Value Interpretation:
- p < 0.001: Very significant
- p < 0.01: Significant
- p < 0.05: Significant
- p β₯ 0.05: Not significant
Effect Size:
- Small: |r| < 0.3
- Medium: 0.3 β€ |r| < 0.5
- Large: |r| β₯ 0.5
File: machine-learning-visualization-part-4.ipynb
File on Kaggle: Kaggle link File on Github: Github link
Focus: Three-dimensional and advanced multi-dimensional visualizations.
Code Cells: 6 | Markdown Cells: 8
- Create 3D scatter and surface plots
- Build interactive 3D visualizations with Plotly
- Visualize multi-dimensional data
- Use dimensionality reduction (PCA) for visualization
- Create bubble charts (4D visualization)
flowchart TD
A[Multi-Dimensional Data] --> B{Dimensions}
B -->|3D| C[Direct 3D Plot]
B -->|>3D| D[Dimensionality Reduction]
D --> E[PCA/t-SNE]
E --> C
C --> F{Plot Type}
F --> G[3D Scatter]
F --> H[3D Surface]
F --> I[3D Line]
G --> J[Add Interactivity]
H --> J
I --> J
J --> K{Library}
K -->|Matplotlib| L[Static 3D]
K -->|Plotly| M[Interactive 3D]
L --> N[Final Visualization]
M --> N
| Visualization | Dimensions | Best For | Library |
|---|---|---|---|
| 3D Scatter | 3-4 (with color/size) | Point distributions | Matplotlib/Plotly |
| 3D Surface | Z = f(X, Y) | Continuous functions | Matplotlib/Plotly |
| 3D Line | Time-series in 3D | Trajectories, paths | Matplotlib |
| Bubble Chart | 4 (x, y, z, size) | Multi-dimensional relationships | Plotly |
| PCA 3D | N β 3 | High-dimensional data | Plotly + sklearn |
Core Libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler3D Matplotlib Setup:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')Dataset: Brain Stroke Dataset
Features for 3D Visualization:
- Age
- BMI (Body Mass Index)
- Average Glucose Level
- (Color/size for additional dimensions)
Matplotlib Example:
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
# Create scatter plot
scatter = ax.scatter(df['age'],
df['bmi'],
df['avg_glucose_level'],
c=df['stroke'], # Color by target
cmap='viridis',
s=50, # Point size
alpha=0.6,
edgecolors='k')
# Labels
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('BMI', fontsize=12)
ax.set_zlabel('Glucose Level', fontsize=12)
ax.set_title('3D Patient Data Visualization', fontsize=14)
# Add colorbar
plt.colorbar(scatter, label='Stroke')
plt.show()Key Parameters:
| Parameter | Description | Example |
|---|---|---|
projection='3d' |
Enable 3D axes | Required for 3D |
c |
Color values | Numeric or categorical |
cmap |
Color map | 'viridis', 'plasma' |
s |
Point size | 50, or array |
alpha |
Transparency | 0.0 to 1.0 |
Why Plotly?
- Interactive rotation
- Zoom and pan
- Hover information
- Better for presentations
Basic Plotly 3D:
fig = px.scatter_3d(df,
x='age',
y='bmi',
z='avg_glucose_level',
color='stroke',
symbol='gender',
size='age',
hover_data=['work_type', 'smoking_status'],
title='Interactive 3D Patient Analysis',
labels={'age': 'Age (years)',
'bmi': 'Body Mass Index',
'avg_glucose_level': 'Glucose Level'})
fig.update_traces(marker=dict(line=dict(width=0.5, color='DarkSlateGrey')))
fig.show()Plotly Advantages:
- β Interactive controls
- β Hover tooltips
- β Export to HTML
- β Better for web dashboards
Creating Meshgrid Data:
# Generate grid
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
# Define function
Z = np.sin(np.sqrt(X**2 + Y**2))Matplotlib Surface:
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, Z,
cmap='coolwarm',
edgecolor='none',
alpha=0.8)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.colorbar(surf)
plt.show()Plotly Surface:
fig = go.Figure(data=[go.Surface(x=X, y=Y, z=Z,
colorscale='Viridis')])
fig.update_layout(title='3D Surface Plot',
scene=dict(
xaxis_title='X Axis',
yaxis_title='Y Axis',
zaxis_title='Z Axis'),
width=900,
height=700)
fig.show()Why PCA for Visualization?
- Reduce high-dimensional data to 3D
- Preserve maximum variance
- Visualize complex datasets
PCA Workflow:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Select numeric features
features = df.select_dtypes(include=[np.number]).columns
X = df[features]
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
# Create DataFrame
pca_df = pd.DataFrame(data=X_pca,
columns=['PC1', 'PC2', 'PC3'])
pca_df['target'] = df['stroke'].values
# Visualize
fig = px.scatter_3d(pca_df,
x='PC1', y='PC2', z='PC3',
color='target',
title='PCA 3D Visualization')
fig.show()Explained Variance:
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:",
sum(pca.explained_variance_ratio_))Adding Fourth Dimension with Size:
fig = px.scatter_3d(df,
x='age',
y='bmi',
z='avg_glucose_level',
color='stroke', # 4th dimension
size='heart_disease', # 5th dimension!
hover_name='id',
title='5D Visualization (x, y, z, color, size)')
fig.show()Dimension Mapping:
| Dimension | Visual Encoding | Best For |
|---|---|---|
| X-axis | Horizontal position | Primary variable |
| Y-axis | Vertical position | Secondary variable |
| Z-axis | Depth | Tertiary variable |
| Color | Hue | Categorical or continuous |
| Size | Point radius | Magnitude or importance |
| Shape | Marker type | Categories (limited) |
- Limit data points: Too many points obscure patterns
- Use appropriate projections: Orthographic for technical, perspective for natural
- Add interactivity: Rotation enhances understanding
- Choose colors carefully: 3D depth perception affected by color
- Provide multiple views: Show from different angles
- Consider accessibility: Some users struggle with 3D perception
- Standardize data: Before PCA or other dimensionality reduction
- Explain axes: Especially for PCA (variance explained)
Camera Position (Plotly):
fig.update_layout(
scene_camera=dict(
eye=dict(x=1.5, y=1.5, z=1.5),
center=dict(x=0, y=0, z=0),
up=dict(x=0, y=0, z=1)
)
)Viewing Angle (Matplotlib):
ax.view_init(elev=30, azim=45) # Elevation and azimuthGood Use Cases:
- β Truly 3-dimensional data (spatial, physical)
- β Demonstrations and presentations (interactive)
- β Exploratory analysis of multi-dimensional data
- β Showing trajectories or time-series paths
When to Avoid:
- β 2D alternatives are clearer
- β Printed/static reports (hard to interpret)
- β Precise value reading required
- β Large datasets (performance issues)
File: machine-learning-visualization-part-5.ipynb
File on Kaggle: Kaggle link File on Github: Github link
Focus: Visualizing and handling missing data, binning, and data preprocessing.
Code Cells: 7 | Markdown Cells: 8
- Visualize missing data patterns
- Assess data quality
- Perform binning and discretization
- Handle missing values appropriately
- Create preprocessed datasets for modeling
flowchart TD
A[Raw Dataset] --> B[Load Data]
B --> C[Check Missing Values]
C --> D{Missing Data?}
D -->|Yes| E[Visualize Patterns]
D -->|No| K[Proceed to Analysis]
E --> F[Missing Matrix]
E --> G[Bar Chart]
E --> H[Heatmap]
E --> I[Dendrogram]
F --> J{Action Required?}
G --> J
H --> J
I --> J
J -->|Drop| L[Remove Rows/Columns]
J -->|Impute| M[Fill Values]
J -->|Keep| K
L --> K
M --> K
K --> N[Binning/Discretization]
N --> O[Final Clean Dataset]
| Visualization | Purpose | Library | Insights Provided |
|---|---|---|---|
| Missing Matrix | Overview of missingness | missingno | Patterns, extent |
| Bar Chart | Missing counts per column | missingno | Which features affected |
| Heatmap | Correlation of missingness | missingno | Related missing patterns |
| Dendrogram | Hierarchical clustering | missingno | Groups of missingness |
Core Libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # Specialized for missing dataInstalling missingno:
pip install missingnoDataset: Heart Disease Dataset (with induced missing values for demonstration)
Initial Check:
# Load data
df = pd.read_csv('heart.csv')
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
# Percentage missing
print("\nPercentage missing:")
print((df.isnull().sum() / len(df)) * 100)Importance:
- Pattern Detection: Random vs. systematic missingness
- Impact Assessment: How much data is affected
- Relationship Analysis: Which variables have correlated missingness
- Decision Making: Drop, impute, or keep as-is
Types of Missingness:
| Type | Description | Example | Handling |
|---|---|---|---|
| MCAR | Missing Completely At Random | Random survey non-response | Safe to drop |
| MAR | Missing At Random | Income missing for unemployed | Impute conditionally |
| MNAR | Missing Not At Random | High earners hide income | Complex imputation |
Matrix Visualization:
# Missing data matrix
msno.matrix(df, figsize=(12, 6), fontsize=12)
plt.title('Missing Data Matrix')
plt.show()Interpretation:
- White lines = missing values
- Black/colored = present values
- Patterns indicate systematic missingness
Bar Chart:
# Missing data bar chart
msno.bar(df, figsize=(12, 6), fontsize=12, color='steelblue')
plt.title('Missing Data Count by Feature')
plt.show()Shows:
- Absolute count of missing values
- Completeness bar (on right axis)
Heatmap:
# Missing data correlation heatmap
msno.heatmap(df, figsize=(12, 10), fontsize=12)
plt.title('Missing Data Correlation')
plt.show()Interpretation:
- Values close to 1: Missingness strongly correlated
- Values close to 0: Independent missingness
- Negative values: Inverse relationship
Dendrogram:
# Hierarchical clustering of missingness
msno.dendrogram(df, figsize=(12, 6), fontsize=12)
plt.title('Missing Data Dendrogram')
plt.show()Use: Identifies groups of features with similar missing patterns
Strategies:
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| Drop Rows | MCAR, <5% missing | Simple, no bias | Data loss |
| Drop Columns | >50% missing, not important | Clean dataset | Feature loss |
| Mean/Median Imputation | MCAR, numeric data | Simple, fast | Reduces variance |
| Mode Imputation | Categorical data | Preserves distribution | May increase mode frequency |
| Forward/Backward Fill | Time series | Maintains trends | Propagates errors |
| Interpolation | Ordered data | Smooth estimates | Assumes continuity |
| Model-Based | MAR, complex patterns | Sophisticated | Computationally expensive |
Code Examples:
# Drop rows with any missing values
df_dropped = df.dropna()
# Drop columns with >50% missing
threshold = len(df) * 0.5
df_dropped_cols = df.dropna(axis=1, thresh=threshold)
# Mean imputation
df['age'].fillna(df['age'].mean(), inplace=True)
# Median imputation (more robust to outliers)
df['chol'].fillna(df['chol'].median(), inplace=True)
# Mode imputation for categorical
df['cp'].fillna(df['cp'].mode()[0], inplace=True)
# Forward fill (time series)
df.fillna(method='ffill', inplace=True)
# Interpolation
df['chol'].interpolate(method='linear', inplace=True)Advanced: Multiple Imputation:
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)Purpose: Convert continuous variables to categorical bins
Why Bin Data?
- Simplify models: Reduce continuous complexity
- Handle outliers: Group extreme values
- Create categories: For business rules (e.g., age groups)
- Improve interpretability: Easier to understand
Equal-Width Binning:
# Create bins of equal width
df['age_bin'] = pd.cut(df['age'],
bins=5, # Number of bins
labels=['Very Young', 'Young', 'Middle',
'Senior', 'Elderly'])
# Custom bin edges
df['chol_bin'] = pd.cut(df['chol'],
bins=[0, 200, 240, 300],
labels=['Low', 'Normal', 'High'])Equal-Frequency Binning (Quantiles):
# Each bin has approximately same number of observations
df['age_qbin'] = pd.qcut(df['age'],
q=4, # Quartiles
labels=['Q1', 'Q2', 'Q3', 'Q4'])Visualizing Bins:
# Distribution of binned data
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
df['age_bin'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Age Distribution (Equal-Width Bins)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
df['age_qbin'].value_counts().plot(kind='bar', color='lightcoral')
plt.title('Age Distribution (Quantile Bins)')
plt.xlabel('Quartile')
plt.ylabel('Count')
plt.tight_layout()
plt.show()Multi-Dimensional Visualization:
# Joint plot with categorical hue
sns.jointplot(data=df,
x='age',
y='chol',
hue='target',
kind='kde',
fill=True,
alpha=0.5,
height=10)
plt.suptitle('Age vs. Cholesterol by Heart Disease Status',
y=1.02, fontsize=14)
plt.show()Benefits:
- Shows distributions for each category
- Identifies class separation
- Useful for feature selection
- Always visualize first: Before handling missing data
- Document decisions: Record why you dropped/imputed
- Check assumptions: Ensure MCAR before simple imputation
- Test sensitivity: See how imputation affects models
- Preserve original data: Keep a copy before modifications
- Consider domain knowledge: Subject matter experts guide imputation
- Bin carefully: Too few bins lose information, too many overfit
- Choose appropriate bin strategy: Equal-width vs. quantile based on use case
- Missing values identified and quantified
- Missingness patterns analyzed
- Appropriate handling strategy selected
- Imputation assumptions validated
- Outliers identified and addressed
- Binning applied where beneficial
- Data types correct
- Ranges validated (no impossible values)
- Duplicates checked
- Final dataset documented
| Feature | Part 01 | Part 02 | Part 03 | Part 04 | Part 05 |
|---|---|---|---|---|---|
| Primary Focus | Basic plots | Geographic | Statistical | 3D/Multi-D | Data quality |
| Code Cells | 80 | 12 | 5 | 6 | 7 |
| Markdown Cells | 18 | 18 | 6 | 8 | 8 |
| Difficulty | β Beginner | ββ Intermediate | ββ Intermediate | βββ Advanced | ββ Intermediate |
| Interactivity | Medium | High | Low | High | Medium |
| Main Library | Matplotlib | Plotly | Seaborn | Plotly/Matplotlib | missingno |
| Dataset Used | Various | Gapminder + Heart | Heart Disease | Brain Stroke | Heart Disease |
| Key Technique | Scatter/Bar | Choropleth maps | Joint plots | 3D scatter | Missing data viz |
| Animation | β | β | β | β | β |
| 3D Support | β | β | β | β | β |
| Statistical Tests | β | β | β | β | β |
| Best For | Learning basics | Location data | Correlations | Complex data | Preprocessing |
| Library | Part 01 | Part 02 | Part 03 | Part 04 | Part 05 |
|---|---|---|---|---|---|
| Pandas | β | β | β | β | β |
| NumPy | β | β | β | β | β |
| Matplotlib | β | β | β | β | β |
| Seaborn | β | β | β | β | β |
| Plotly Express | β | β | β | β | β |
| Plotly Graph Objects | β | β | β | β | β |
| Geocoder | β | β | β | β | β |
| SciPy | β | β | β | β | β |
| Scikit-learn | β | β | β | β | β |
| missingno | β | β | β | β | β |
-
Know Your Audience
- Technical vs. non-technical
- Adjust complexity accordingly
- Provide context and interpretation
-
Choose the Right Chart Type
Comparison β Bar charts Distribution β Histograms, box plots Relationship β Scatter plots Composition β Pie charts, stacked bars Trends β Line charts Geographic β Choropleth, point maps -
Design for Clarity
- Clear titles and labels
- Appropriate color schemes
- Sufficient white space
- Readable font sizes
- Legends when needed
-
Color Usage
- Sequential: One variable, ordered (e.g., low to high)
- Diverging: Data with a meaningful center (e.g., correlations)
- Categorical: Distinct categories
- Accessibility: Color-blind friendly palettes
-
Storytelling with Data
- Guide the viewer's attention
- Highlight key insights
- Provide context
- Explain unexpected patterns
-
Reproducibility
# Set random seed np.random.seed(42) # Document versions # Python 3.8.10 # pandas 1.3.0 # matplotlib 3.4.2
-
Modularity
def create_scatter_plot(df, x, y, hue=None, title=''): """ Create standardized scatter plot. Parameters: ----------- df : DataFrame x, y : str, column names hue : str, optional categorical column title : str """ fig, ax = plt.subplots(figsize=(10, 6)) sns.scatterplot(data=df, x=x, y=y, hue=hue, ax=ax) ax.set_title(title, fontsize=14) plt.tight_layout() return fig, ax
-
Error Handling
try: df = pd.read_csv('data.csv') except FileNotFoundError: print("Error: File not found") except pd.errors.EmptyDataError: print("Error: File is empty")
-
Documentation
- Comment complex operations
- Use docstrings for functions
- Explain non-obvious choices
- Include sources for data/methods
-
Large Datasets
- Sample for initial exploration
- Use appropriate data types
- Consider aggregation
- Use hexbin for dense scatter plots
-
Interactive Plots
- Limit data points for Plotly (< 10k recommended)
- Use webgl renderer for large datasets
- Disable unused features
-
Memory Management
# Delete unnecessary DataFrames del df_temp # Use categorical dtype df['category'] = df['category'].astype('category') # Load only needed columns df = pd.read_csv('data.csv', usecols=['col1', 'col2'])
| Data Type | Comparison | Distribution | Relationship | Composition | Trend |
|---|---|---|---|---|---|
| Categorical | Bar, column | - | - | Pie, stacked bar | - |
| Continuous | Box plot | Histogram, KDE | Scatter, joint plot | Area chart | Line chart |
| Time Series | - | - | Line, area | Stacked area | Line chart |
| Geographic | - | - | - | Choropleth | Animated map |
| 3D | - | - | 3D scatter | 3D surface | 3D line |
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# Your plot code here
plt.title('Title', fontsize=14)
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.tight_layout()
plt.show()import seaborn as sns
sns.set_style('whitegrid')
sns.scatterplot(data=df, x='var1', y='var2', hue='category')
plt.show()import plotly.express as px
fig = px.scatter(df, x='var1', y='var2', color='category',
hover_data=['additional_info'])
fig.show()import missingno as msno
# Quick overview
msno.matrix(df)
plt.show()
# Detailed analysis
print(df.isnull().sum())
print((df.isnull().sum() / len(df)) * 100)Seaborn Built-in:
deep,muted,pastel,bright,dark,colorblind
Matplotlib Colormaps:
- Sequential:
viridis,plasma,inferno,magma,cividis - Diverging:
coolwarm,RdYlBu,seismic - Qualitative:
tab10,tab20,Set1,Set2,Set3
Plotly Color Scales:
- Sequential:
Blues,Greens,Reds,Viridis,Plasma - Diverging:
RdBu,PiYG,Spectral
# Matplotlib
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.savefig('plot.svg') # Vector format
plt.savefig('plot.pdf')
# Plotly
fig.write_html('plot.html')
fig.write_image('plot.png', width=1200, height=800)| Issue | Solution |
|---|---|
| Plot not showing | Call plt.show() or use %matplotlib inline in Jupyter |
| Overlapping labels | Use plt.tight_layout() or adjust figure size |
| Too slow (Plotly) | Reduce data points or use sampling |
| Memory error | Load data in chunks or use smaller sample |
| Font too small | Increase with fontsize parameter |
| Legend outside plot | bbox_to_anchor=(1.05, 1), loc='upper left' |
These five notebooks provide a comprehensive journey through data visualization for machine learning:
- Part 01: Foundation - Basic plots and techniques
- Part 02: Geographic - Maps and spatial data
- Part 03: Statistical - Correlations and distributions
- Part 04: Advanced - 3D and multi-dimensional
- Part 05: Quality - Missing data and preprocessing
graph LR
A[Complete Beginner] --> B[Part 01: Basics]
B --> C{Interest?}
C -->|Location Data| D[Part 02: Geographic]
C -->|Statistics| E[Part 03: Statistical]
C -->|Advanced Tech| F[Part 04: 3D]
C -->|Data Cleaning| G[Part 05: Missing Data]
D --> H[Intermediate Level]
E --> H
F --> H
G --> H
H --> I[Combine Techniques]
I --> J[Real Projects]
- Practice: Apply techniques to your own datasets
- Combine: Use multiple visualization types together
- Customize: Develop your own plotting functions
- Share: Create dashboards and reports
- Contribute: Improve these notebooks on GitHub
- Matplotlib Documentation
- Seaborn Tutorial
- Plotly Python Guide
- missingno Documentation
- Kaggle Datasets
Author: Kaggle User thecoder8890
Repository: thecoder8890/ml-visual-handbook
Last Updated: 2025
License: MIT (if applicable)
This documentation is maintained alongside the notebooks. For issues, suggestions, or contributions, please open an issue on GitHub.