<a href="https://colab.research.google.com/github/sethkipsangmutuba/Vizualization/blob/main/a2_DVRSeth1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2 Data Foundations

### 2.1 Types of Data

### 2.2 Structure within and between Records

### 2.3 Data Preprocessing

### 2.4 Data Sets Used in This Book

### 2.5 Related Readings

### 2.6 Exercises

### 2.7 Projects


# 2: Data Foundations

## Overview
All visualizations begin with data — it may come from sensors, surveys, simulations, or computations.

Data can be raw (unprocessed) or derived (processed: e.g., smoothed, scaled).

A dataset is typically a list of n records:  
r₁, r₂, ..., rₙ, each with m variables:  
v₁, v₂, ..., vₘ.

Variables are categorized as:

- **Independent (iv):** Unaffected by others (e.g., time).
- **Dependent (dv):** Influenced by independent variables (e.g., temperature affected by location or time).

A formal structure:  
rᵢ = (iv₁, iv₂, ..., ivₘᵢ, dv₁, dv₂, ..., dvₘ𝒹),  
with m = mᵢ + m𝒹.

In many datasets, the independent/dependent nature may not be known in advance.

Data can also be seen as being generated by a function:

- **Domain** = independent variables  
- **Range** = dependent variables

Real-world data often contains only a subset of all possible value combinations.

---

## 2.1 Types of Data

### Two Primary Data Types:

1. **Ordinal (Numeric)**  
   - **Binary:** Only values 0 or 1  
   - **Discrete:** Integer or limited value sets (e.g., 2, 4, 6)  
   - **Continuous:** Real numbers in a range (e.g., [0, 5])  

2. **Nominal (Non-Numeric)**  
   - **Categorical:** From a small, finite list (e.g., red, green)  
   - **Ranked:** Categories with a natural order (e.g., small, medium, large)  
   - **Arbitrary:** No order, possibly infinite values (e.g., street names)  

---

### Scale of Measurement

Variables can be evaluated based on three scale attributes:

- **Ordering relation:**  
  Exists for ranked nominal and all ordinal variables.

- **Distance metric:**  
  Applicable only to ordinal data types (e.g., temperature differences).

- **Absolute zero:**  
  Present in some ordinal variables (e.g., weight), but not others (e.g., bank balance).

**Scale compatibility:**  
When mapping data to graphics, match the variable's scale with the visual encoding (e.g., size, position).

---

### Operations by Data Type

| Operation Type       | Nominal | Ranked Nominal | Ordinal |
|----------------------|---------|----------------|---------|
| Equality (=, <>)     | ✅      | ✅             | ✅      |
| Comparison (<, >)    | ❌      | ✅             | ✅      |
| Math (+, –, ×, ÷)     | ❌      | ❌             | ✅      |


## 2.2 Structure Within and Between Records

Data records have both **syntax** (how they are represented) and **semantics** (how they relate within and across records).

---

### 2.2.1 Scalars, Vectors, and Tensors

- **Scalar:** Single numerical value (e.g., age, price).  
- **Vector:** Ordered set of scalars forming a single data item  
  - Examples:  
    - 2D displacement: [x, y]  
    - RGB color: [r, g, b]  
    - GPS: [lat, long]  
- **Tensor:** General form including scalars and vectors  
  - **Rank 0:** Scalar  
  - **Rank 1:** Vector  
  - **Rank 2+:** Matrix or higher-dimensional array  

**Example:** 3×3 matrix as a transformation tensor in 3D space.

---

### 2.2.2 Geometry and Grids

- **Explicit Geometry:** Coordinates given directly (e.g., longitude and latitude in temperature data).  
- **Implied Geometry:** Grid-based spacing; position inferred from start point and step size.  
  - Common in elevation maps and simulations.  

**Coordinate Systems:**

- Include Cartesian, spherical, hyperbolic  
- Conversion to display space typically done using transformation matrices  

**Irregular Geometry:**

- High-density near critical areas (e.g., airflow near airplane wings)  
- Requires explicit coordinates due to nonuniform spacing  
- Increases computational complexity for rendering  

---

### 2.2.3 Other Forms of Structure

- **Time (Timestamps):**  
  - Can be absolute or relative  
  - Uniform (e.g., sensor sampling) or nonuniform (e.g., business logs)  
  - Varies in scale from picoseconds to centuries  

- **Topology:**  
  - Describes connectivity among records (e.g., neighbors in a mesh, links in a network)  
  - Key for interpolation and resampling  
  - Stored as:  
    - Explicit links in the record  
    - Auxiliary structures (e.g., adjacency lists)  

---

### Examples of Structured Data

| Domain           | Structure Description                                                  |
|------------------|------------------------------------------------------------------------|
| MRI              | Scalar density + 3D spatial coords, 3D grid connectivity               |
| CFD              | Displacement (3D) + spatial + time, uniform/nonuniform grid            |
| Financial        | No geometry, multiple nominal/ordinal fields, time                     |
| CAD              | 3D spatial + surface properties + polygon/edge connectivity            |
| Remote Sensing   | Multi-channel, 2D/3D space + time + grid                               |
| Census           | Mixed data types + spatial + time + implied connectivity               |
| Social Network   | Nodes with mixed fields, connected by spatial, temporal, or group-related attributes |


## 2.3 Data Preprocessing

Raw data is preferred in many fields (e.g., medical imaging) to avoid:

- Loss of critical details  
- Introduction of misleading artifacts  

However, preprocessing may be essential depending on the data and visualization method.

Preprocessing can expose:

- Missing data  
- Outliers  
- Errors in computation or input  

---

### 2.3.1 Metadata and Statistics

#### Metadata

Provides important context for understanding and processing data:

- Field formats  
- Units of measurement  
- Reference/base values  
- Markers for missing values  
- Measurement resolution  

#### Statistical Analysis

Useful for preprocessing decisions:

- **Outlier Detection:** Identifies anomalous or faulty data  
- **Cluster Analysis:** Groups similar data points  
- **Correlation Analysis:** Detects redundancy or hidden relationships  

**Basic Statistical Measures:**

- **Mean ($\mu$):**  
  $$
  \mu = \frac{1}{n} \sum_{i=1}^{n} x_i
  $$

- **Standard Deviation ($\sigma$):**  
  $$
  \sigma = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 }
  $$

#### Visual Representation

- **Histogram:** Most common plot showing data distribution


### 2.3.2 Missing Values and Data Cleansing

Real-world data often contains missing or erroneous entries due to:

- Sensor failure  
- Data entry mistakes  
- Incomplete surveys or forms  

Handling such data is critical before visualization. Below are common strategies:

---

#### 1. Discard the Bad Record

- Entire record is removed if it contains errors or missing values.  
- **Pros:** Removes uncertainty.  
- **Cons:** May result in large data loss; discarded records may hold valuable insight.  

---

#### 2. Assign a Sentinel Value

- Use a unique placeholder (e.g., -5 for a 0–100 range).  
- **Pros:** Makes problematic entries visible during visualization.  
- **Cons:** Must exclude sentinel values from analysis to avoid skewed results.  

---

#### 3. Assign the Average Value

- Replace missing values with the mean of that variable.  
- **Pros:** Simple, keeps basic statistics stable.  
- **Cons:** May hide outliers or distort true patterns in specific records.  

---

#### 4. Assign Value Based on Nearest Neighbor

- Use the value from the most similar record (based on other variables).  
- **Pros:** More tailored substitution than global average.  
- **Cons:** May not be reliable if the missing variable isn’t correlated with others.  

---

#### 5. Compute a Substitute Value (Imputation)

- Uses advanced statistical techniques to estimate missing values.  
- **Example:** Model-based imputation by Schafer.  
- **Goal:** Maximize statistical confidence in the substituted values.  

---

**Important:**  
Any substituted value should be **flagged** as such.  
Visualizations must reflect that these values are **estimates or placeholders**, not original data.


### 2.3.3 Normalization

**Purpose:**  
Normalization adjusts data values so they fit within a specific statistical or numeric range.  
This is essential for comparing variables with different units or scales and for mapping data to visual elements like color or size.

---

#### Why It Matters

- Visualizations often require input values to fall within a limited range.  
- Variables with different scales can be misleading when visualized together.  
- Ensures fair comparison across dimensions.

---

#### Types of Normalization

**Linear normalization**  
- Maps all values to a uniform range (e.g., 0 to 1).  
- Common in visual encoding to evenly distribute values.  

**Non-linear normalization**  
- Used when data is unevenly distributed.  
- Square root or logarithmic transformations spread out compressed values or compress high-value outliers.  
- Helps improve visibility of important patterns in skewed data.

**Range truncation (Bounding values)**  
- Caps extreme values beyond a chosen threshold.  
- Allows focus on the most relevant part of the data.  
- Helps reduce distortion caused by outliers.

---

#### Quantiles in Normalization

- Boundaries for truncation can be set using **quantiles**.  
- For instance, excluding the top and bottom 1% of values ensures the majority is emphasized.

---

### 2.3.4 Segmentation

**Purpose:**  
Segmentation divides data into meaningful groups or categories.  
It identifies patterns, simplifies interpretation, and improves visualization.

---

#### Examples of Segmentation

- In medical imaging, different tissues (like bone or muscle) are extracted from grayscale MRI data.  
- In geographic or demographic data, regions or populations are grouped based on common attributes.

---

#### Types of Segmentation

**Simple segmentation**  
- Data is divided based on predefined value ranges.

**Probabilistic segmentation**  
- Assigns likelihoods to data points for belonging to various categories, adding nuance.

---

#### Common Issues

**Undersegmentation:**  
Important differences are lost because different types are grouped together.

**Oversegmentation:**  
Too many small, insignificant regions reduce clarity.

---

#### Refinement: Split-and-Merge Approach

This is an iterative method to improve segmentation:

- Similar regions are **merged** based on a similarity threshold.  
- Non-uniform regions are **split** to increase classification accuracy.  
- The process repeats until no further merging or splitting is needed.

---

#### Key Factors in Refinement

- **Similarity:** Often assessed by comparing average values between regions.  
- **Homogeneity:** Evaluated using data distribution within a region.  
- **Region Splitting:** Based on identifying internal variation and separating parts accordingly.

---

#### Avoiding Infinite Loops

Refinement must be managed carefully to prevent repeated splitting and merging of the same regions.  
This is typically handled by **gradually tightening or relaxing the thresholds**.


### 2.3.5 Sampling and Subsetting

---

#### Overview

Sampling and subsetting are vital data preprocessing steps used to adapt data resolution, manage large datasets, and reduce complexity in visualization tasks.

---

### 1. Sampling and Resampling

**Purpose:**  
Transform a dataset from one spatial resolution to another — commonly needed when resizing images or estimating values between sparse data points.

---

#### Interpolation

Interpolation estimates values between known data points under the assumption that the data represent a discrete sampling of a continuous phenomenon.

- **Linear Interpolation**  
  Estimates intermediate values along a straight line between two known points.  
  Simple but may result in abrupt changes across boundaries.

- **Bilinear Interpolation**  
  Extends linear interpolation to two dimensions (e.g., image grids).  
  Useful for estimating values at fractional coordinates between grid points.

- **Nonlinear Interpolation**  
  Uses higher-order polynomial curves (like splines) for smoother transitions.  
  *Catmull-Rom* curves are often employed because they pass through known data points and provide continuity.

>  **Nonlinear methods** are preferred when smooth transitions between data points are critical, such as in image smoothing or surface modeling.

---

### 2. Subsampling and Data Reduction

When datasets are too dense or large:

- **Subsampling** reduces data volume by selecting representative points.

**Methods:**
- Regular sampling (e.g., every *n*th point)  
- Averaging neighborhoods  
- Selecting medians or random values  
- Domain-specific feature preservation

>  **Caution:** Naïve subsampling may miss important features like small objects or sharp boundaries.

---

### 3. Data Subsetting

**Purpose:** Focus analysis or visualization on relevant data portions.

#### Query-based Subsetting
- Filters data using conditions (e.g., time ranges, thresholds)  
- Efficient — avoids loading entire datasets  
- Suitable for structured or database-driven tasks

#### Interactive Subsetting
- User visually selects portions of data (e.g., brushing, highlighting)  
- More flexible and exploratory  
- Allows intuitive filtering during visual exploration

---

### Post-Subset Actions

- **Delete:** Remove irrelevant data  
- **Mask:** Hide everything except the selected subset  
- **Highlight:** Emphasize chosen parts for comparison

---

### Key Takeaways

- Sampling and interpolation are crucial for handling sparse or differently scaled data.  
- Subsetting simplifies large datasets for analysis without overwhelming the user or the visualization system.  
- A **balance between accuracy and efficiency** is needed to maintain data fidelity while reducing complexity.


### 2.3.6 Dimension Reduction

When data has many dimensions — more than a visualization technique or human cognition can effectively handle — we use **dimension reduction** to simplify the data while preserving as much information as possible.

This process helps highlight **patterns, clusters, and outliers** without needing to display every variable.

---

#### Manual and Automated Techniques

- **Manual Selection**: The user chooses which variables to retain based on domain knowledge.

- **Automated Methods**:

  - **Principal Component Analysis (PCA)**  
    A linear technique that creates new axes (*principal components*) as combinations of the original dimensions.  
    These new axes are ranked by how much **variance** they explain in the data.  
    📖 [More on PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)

  - **Multidimensional Scaling (MDS)**  
    Places data in a lower-dimensional space to match high-dimensional distances.  
    📖 [More on MDS](https://en.wikipedia.org/wiki/Multidimensional_scaling)

  - **Kohonen Self-Organizing Maps (SOMs)**  
    Unsupervised neural networks that map high-dimensional data to a 2D grid while preserving topology.  
    📖 [More on SOMs](https://en.wikipedia.org/wiki/Self-organizing_map)

  - **Local Linear Embedding (LLE)**  
    A nonlinear method that preserves local structure during reduction.  
    📖 [More on LLE](https://en.wikipedia.org/wiki/Locally_linear_embedding)

---

#### PCA: How It Works

1. Subtract the mean from each data dimension.  
2. Compute the **covariance matrix**.  
3. Calculate **eigenvectors** and **eigenvalues**.  
4. Sort eigenvectors by descending eigenvalue magnitude.  
5. Choose top $k$ eigenvectors as new axes.  
6. Project the data onto these axes.

> PCA is widely used for efficient linear transformation.  
> Example: The **Iris dataset** often demonstrates class clustering in 2D after PCA.

---

#### MDS: How It Works

1. Calculate all pairwise distances in the high-dimensional space.  
2. Randomly initialize 2D/3D point positions.  
3. Iteratively adjust positions to minimize **stress** (difference between actual and target distances).  
4. Stop when improvement is minimal.

> ⚠️ MDS is **sensitive to initialization** and may need repeated runs or enhancements like [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) to avoid local minima.

---

### 2.3.7 Mapping Nominal Dimensions to Numbers

Nominal variables (e.g., car brands, model names) lack inherent order.  
**Mapping them to numbers or coordinates must avoid implying false relationships.**

#### Strategies

- **Direct Labeling**  
  Best for small datasets but can become cluttered.

- **Symbols or Colors**  
  Use distinct markers or hues without implying numerical order.

- **Statistical Similarity**  
  Use numeric attributes to group similar nominal values.  
  Then apply methods like **MDS** to assign positions.

- **Correspondence Analysis**  
  A technique like PCA, but for **categorical data**.  
  📖 [More on Correspondence Analysis](https://en.wikipedia.org/wiki/Correspondence_analysis)

---

### 2.3.8 Aggregation and Summarization

Large or dense datasets benefit from **aggregation**, which simplifies data by grouping similar records and summarizing them.

---

#### Key Concepts

- **Clustering**:  
  Group points by similarity using:

  - Bottom-up (agglomerative) merging  
  - Top-down (divisive) partitioning  
  - Split-and-merge methods  
  📖 [More on clustering](https://en.wikipedia.org/wiki/Cluster_analysis)

- **Summary Statistics**:  
  Compute measures like means, variances, and counts per group.

---

#### Visualization of Aggregates

- Show **summary shapes or symbols** for each cluster  
- Allow **drill-down** to explore internal variation  
- Avoid **over-summarizing**, which can hide meaningful patterns

>  Example: In the **Iris dataset**, parallel coordinate plots with aggregation help spot dominant clusters, while still allowing detailed exploration.


### 2.3.9 Smoothing and Filtering

**Purpose**  
- Reduce noise and blur sharp changes in data.  
- Common in **signal** and **time-series** processing.

---

####  Convolution (Weighted Averaging)

Smooths data using values from neighboring points.

**1D Example:**

$$
p_i' = \frac{1}{4}p_{i-1} + \frac{1}{2}p_i + \frac{1}{4}p_{i+1}
$$

Used for sliding-window smoothing. The weights define the **filter kernel**.

---

#### Exponential Smoothing

Gives more weight to **recent values** than older ones.

**Formula:**

$$
s_0 = x_0,\quad s_t = \alpha x_{t-1} + (1 - \alpha)s_{t-1}
$$

Where:  
- $x_t$ is the observed value at time $t$  
- $s_t$ is the smoothed value  
- $\alpha$ is the **smoothing factor**, $0 < \alpha < 1$

---

 **Further Reading**  
- [Smoothing (Wikipedia)](https://en.wikipedia.org/wiki/Smoothing)  
- [Exponential Smoothing](https://en.wikipedia.org/wiki/Exponential_smoothing)  
- [Convolution](https://en.wikipedia.org/wiki/Convolution)

---

### 2.3.10 Raster-to-Vector Conversion

 **Purpose**  
Transform raster images (pixel-based) into vector graphics (points, lines, shapes).

---

 **Why Convert?**

- More compact and scalable representation  
- Easier **geometric transformation** (e.g., rotation, scaling)  
- Better for **analysis**, **comparison**, and **modeling**

---

 **Techniques**

- **Thresholding**  
  Segment the image by applying intensity cutoffs.  
  📖 [Thresholding (Image Processing)](https://en.wikipedia.org/wiki/Thresholding_(image_processing))

- **Region Growing**  
  Expand from seed pixels by adding neighboring pixels with similar values.  
  📖 [Region Growing](https://en.wikipedia.org/wiki/Region_growing)

- **Boundary Detection (Edge Detection)**  
  Detects edges using filters like **Sobel** or **Canny**.  
  📖 [Edge Detection](https://en.wikipedia.org/wiki/Edge_detection)  
  📖 [Kernel in Image Processing](https://en.wikipedia.org/wiki/Kernel_(image_processing))

- **Thinning (Skeletonization)**  
  Reduces thick structures to 1-pixel-wide centerlines.  
  📖 [Topological Skeleton](https://en.wikipedia.org/wiki/Topological_skeleton)

---

**Related Topics**

- [Vectorization (Image Tracing)](https://en.wikipedia.org/wiki/Vectorization_(image_tracing))  
- [Image Processing Overview](https://en.wikipedia.org/wiki/Image_processing)


### 2.3.11 Summary of Data Preprocessing

Data preprocessing steps such as smoothing, normalization, or transformation improve visualization quality and facilitate discovery — but they also **modify the raw data**.

---

 **Why it matters:**

- Users should be informed of any preprocessing to avoid misinterpretation.
- Lack of transparency can lead to **false conclusions** (see Chapter 13).

---

### 2.4 Data Sets Used in This Book

A wide variety of datasets are used to demonstrate visualizations throughout the book. Most are available either on the book’s companion site or from open data repositories.

---

 **Public Data Portals**

- [U.S. Open Data](https://www.data.gov)  
- [European Union Open Data](https://open-data.europa.eu)

---

 **Sample Data Sets**

- **DJIA** (`djia-100.xls`)  
   100+ years of Dow Jones daily closings  
   [Source](https://www.analyzeindices.com/dow-jones-history.shtml)

- **Colorado Elevation** (`colorado_elev.vit`)  
   Elevation grid of Colorado  
   [Source](https://opendx.org)

- **UVW Flow Field** (`uvw.dat`)  
   3D vector field of turbulent flow  
   Courtesy: Drs. Jiacai Lu and Gretar Tryggvason  
   [Source](https://www.me.wpi.edu/Tryggvason)

- **U.S. City Temperatures** (`city_temp.xls`)  
   Average January temperatures for 56 U.S. cities  
   [Source](https://lib.stat.cmu.edu/datasets/city-temp)

- **CT Head MRI** (`CThead.zip`)  
   3D medical scan (256×256×113)  
   [Source](https://graphics.stanford.edu/data/voldata/)

- **Miscellaneous Excel Files** (`cars.xls`, `detroit.xls`, `cereal.xls`)  
   Multivariate, nonspatial data sets  
   [Source](https://lib.stat.cmu.edu)

- **Health-Related Data**  
   UNICEF indicators, CDC obesity rates  
   [Source](https://www.openindicators.org)

- **VAST Challenge Data**  
   Synthetic, complex multivariate data with embedded truth  
   [Source](https://hcil.umd.edu/localphp/hcil/vast/archive/index.php)

- **U.S. County Census**  
   Age, race, household composition  
   [Raw Data](https://www.census.gov)  
   [Cleaned Version](https://www.openindicators.org/data)

- **Iris Data** (`iris.csv`)  
   Classic flower measurement dataset  
   [Source](https://archive.ics.uci.edu/ml/datasets/Iris)

---

### 2.5 Related Readings

 **Additional references for deepening understanding:**

- Preprocessing theory: “Data Preprocessing”, in [352]
- Dimension reduction: PCA, manifold models ([53])

---

 **File Format Encyclopedias**

- Brown et al. for image formats  
- Older archive: [Wotsit](https://www.wotsit.org)

---

 **Online Data Repositories**

- [StatLib (CMU)](https://lib.stat.cmu.edu)  
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
- [NOAA Climate Data](https://www.ncei.noaa.gov)  
- [VAST Archives](https://hcil.umd.edu/localphp/hcil/vast/archive/index.php)

---

### 2.6 Exercises

Engage students with reflective and applied thinking:

- Identify datasets with or without:
  - Ordering  
  - Distance metric  
  - Absolute zero

- Distinguish between **attribute** vs. **value** with examples.

- Compare missing data strategies:
  - Delete rows  
  - Use placeholders like -999  
  - Mean substitution  
  - Nearest-neighbor fill

- Search public data repositories and analyze their types and structure.

- Extract data from newspapers — by section — and design potential datasets.

- Identify 10+ sources of everyday data (e.g., nutrition labels, phone step counter).

- Download recent temperatures and apply **smoothing (convolution)**.

---

### 2.7 Projects

Projects for deeper, hands-on application:

- **Resample 3D scalar field**  
  📥 Input: dimensions (height, width, depth)  
  ➡️ Output: resampled volume

- **Bin categorization strategies**  
  - Uniform bin width  
  - Uniform bin count  
  - Gap-based splits

- **Normalization Program**
  - Normalize to range $$[0, 1]$$  
  - Normalize to mean $$= 0$$ and standard deviation $$= 1$$  
  - Map to integer range $$[0, 255]$$

- **Missing Value Imputation (Schafer’s Model)**
  - Use R: [https://www.r-project.org](https://www.r-project.org)  
  - Compare with your own implementation

- **PCA Implementation on Iris Data**
  - Identify principal components  
  - Visualize results using **dimensional reduction**
